The concept of musical form refers to a variety of structural phenomena; most typically, it is used to designate the way musical pieces are organized at various hierarchical levels, such as the level of individual phrases, themes, sections, and movements (Caplin et al., 2009; Schoenberg, 1967). It has been pointed out that listeners may be trained to perceive musical forms (Smith, 1973). However, notwithstanding the recent revived interest in the theory of form (Formenlehre), the music psychology community still lags behind with regard to developing experimental methods to study the perception of individual constituents of the musical form (Sears et al., 2012).
Modeling musical form based on either audio or symbolic data is a challenging task which has been addressed by a number of music information retrieval (MIR) papers. In a study of 28 first movements from Beethoven’s piano sonatas in audio representation, Jiang and Müller (2013) use the similarity between the first and the last section (exposition and recapitulation) of movements in sonata form to model various aspects of the musical structure. Bigo et al. (2017) derive a Hidden Markov Model from features of melodic patterns, harmony, etc. extracted from symbolic scores to sketch sonata-form structure in selected string-quartet movements by Haydn and Mozart. Allegraud et al. (2019) combine expert annotations with machine learning algorithms to explore a Hidden Markov Model as a possible framework for modeling the typical sequence of sections in sonata-form movements in connection with thematic, harmonic and rhythmic features. Feisthauer et al. (2019) model structural features associated with the Medial Caesura (Hepokoski and Darcy, 2006), which divides a sonata exposition in two, to train a Long Short-Term Memory neural network to predict the occurrence of such caesuras in Mozart’s string-quartet movements. Weiß et al. (2020) use audio data of Beethoven’s piano sonatas to foster a dialogue between algorithm-derived local keys and formal subdivisions based on historical sonata theory. Shibata et al. (2020) use a hybrid model, involving elements of a Hidden Semi-Markov Model, to separate popular music into musically meaningful segments.
At variance with these and several other studies, we do not attempt to model the partitioning into sections across a piece’s formal trajectory; rather, our annotations involve assigning each piece a single form label at the entire movement’s level. In conjunction, we use modeling tools to separate data points representing individual movements by different form labels. While the most common form encountered in the corpus under analysis is the sonata form, we do not set out to model its constituents, but rather to distinguish sonata-form movements from pieces in other formal layouts. This is, to the best of our knowledge, a task not addressed by MIR studies to date. The aforementioned studies, for instance, proceed from a selection of pieces that adhere to a common underlying formal category, and consequently do not address the question of separation by form. We ultimately contemplate a possible prospective integration of our classification method as a first step in a projected MIR workflow, which would start by separating a given corpus of musical pieces according to form labels, subsequently subjecting each piece to formal analysis according to its form label as assigned in the first stage.
To be sure, questions of genre and style classification have been addressed by a considerable number of MIR papers (with relation to both symbolic and audio data), such as, for instance, Fucks (1962); Jiang et al. (2002); Lidy and Rauber (2005); Lidy et al. (2007); Panagakis et al. (2008); Shan and Kuo (2003); Tzanetakis et al. (2003); Tzanetakis and Cook (2002). However, our endeavor to separate pieces on form labels differs essentially from the tasks addressed in these previous studies.
In the present study, we investigate the viability of configuring an algorithm to assign form labels to pieces of the Classical repertoire. Our main working assumption is that the choice of musical form—particularly in the Classical repertoire—is closely related to the succession of local keys across a piece (see Section 2), and that local keys, in turn, are detectable through an analysis of the distribution of pitch classes across a whole piece or a segment thereof, as explained below.
As a main tool to model musical form, we apply pitch-class histograms extracted from symbolic representations of the works under investigation. Pitch-class distributions represent a broadly acknowledged and employed MIR technique, which has been successfully applied to both symbolic data and audio signals to accomplish a variety of MIR tasks, including modeling a piece’s style period (Fucks, 1962), automatic genre classification (Tzanetakis et al., 2003), key detection (Zhu and Kankanhalli, 2006), local key determination from audio data (Weiß et al., 2020), makam and rāga classification (Gedik and Bozkurt, 2010; Koduri et al., 2012), etc. As has been demonstrated in a number of studies on key profiles and key finding, music in a given key is characterized by a typical distribution of pitch classes, making key assignment possible on the basis of comparing actual pitch distributions with an empirically generated ideal distribution for a given major or minor key (key profile). The Krumhansl-Schmuckler key-finding algorithm, which was based on key profiles derived from experiments by Krumhansl and Kessler, in which subjects were asked to rate tonal fit (Krumhansl, 1990; Krumhansl and Kessler, 1982), has been improved upon in several subsequent studies to achieve greater accuracy (Albrecht and Shanahan, 2013; Shmulevich and Yli-Harja, 2000; Temperley, 1999), recent advances also deriving local keys directly from audio data (Schreiber et al., 2020).
While whole-piece histograms, that is, histograms that represent the distribution of pitch classes across an entire movement, retain at least some aspects of a piece’s tonal trajectory, note that one thing they do not reflect is the sequence or order of keys visited throughout a piece. One may proceed from the assumption that the temporal aspect may be better captured by using a set of histograms extracted from contiguous segments of a given piece and arrayed by their temporal order. This technique, referred to as “sliding-window pitch-class histograms,” has been applied in prior research work to detect local key changes (e.g., modulations) in the course of a piece (Shmulevich and Yli-Harja, 2000). The triangular “keyscapes,” conceptualized by Sapp (2001, 2005, 2011), and embraced by additional researchers (Lieck and Rohrmeier, 2020), represent multi-timescale visualizations used to simultaneously display local key estimations at various hierarchical levels. Finally, as shown by Quinn (2010), one may achieve excellent results on key detection also without relying on the customary analysis of pitch-class profiles, and with the window size drastically reduced, by defining the window to include merely a pair of adjacent sonorities that adhere to certain criteria.
Are pitch-class histograms also useful for effectively detecting musical form? Obviously, analysis by pitch-class distribution involves complexity reduction to a degree that does not allow to capture, for instance, the occurrences of particular motives, local harmonic progressions, etc. What pitch-class histograms generally retain is information about a piece’s tonal structure. While whole-piece pitch-class histograms reflect a piece’s tonal trajectory only indirectly, it is expected that by using sliding-window histograms, a richer model of the sequence of keys may be obtained. As will be explained in detail in Section 2, musical form templates differ from one another, among other aspects, by the sequence of local keys. Thus, the sliding-window technique’s advantage in modelling a piece’s tonal trajectory is expected to bear on its potential success in detecting musical form as well.
In the present study we use both whole-piece and sliding-window histograms to perform musical form classification on 122 major-mode piano-sonata movements by W. A. Mozart and L. v. Beethoven. Pieces were provided as symbolic data (MusicXML format). We applied a variety of algorithms—Support Vector Machines (SVMs), Artificial Neural Networks (ANNs) and Gaussian Mixture Models (GMMs)—on both whole-piece and sliding-window representations. The classification/clustering tasks involved three generic formal templates which are common among pieces of the Classical period: sonata form, ABA (or ternary) form, and variation form (see detailed explanation in Section 2). The supervised classification methods (SVMs and ANNs) achieved significantly better results using the sliding-window data. As for the GMM, we detected no compelling correlation between the clustering results and the investigated form categories; nonetheless, separation was still slightly better for the sliding-window data.
Contemplating our results, we address the possibility that the better success of the various algorithms when applied to sliding-window data may be linked to the dynamic behavior of particular pitch classes across a piece’s timeline, as reflected in a number of additional analyses (see Section 5.2).
In the remainder of this paper, we first discuss the ways the differences among musical forms are expected to manifest themselves in pitch-class histograms derived from movements in these forms (Section 2); we then describe our approach to classifying musical forms using both whole-piece and sliding-window histograms in combination with various machine learning algorithms (Section 3); finally, we describe and discuss our results and their possible implications (Sections 4 and 5).
Western art music of the common practice period (approx. 1600–1950) is characterized by a variety of movement forms, ranging, for instance, from simple strophic songs to huge symphonic poems. Different forms are characterized by such aspects as their repetition patterns (certain sections recurring at particular positions across the piece), tonal organization, number and character of musical themes, etc. In particular, specific formal layouts involve a typical tonal trajectory, that is, a succession of specific local keys (defined with regard to the main key) that occur at specific positions along the piece’s timeline. In this section we describe the standard tonal trajectories of the three generic formal templates addressed in our study, and contemplate the expected implications for modeling form on pitch-class histograms.
Sonata form, possibly the most influential formal layout from the eighteenth to the twentieth century, falls into three main sections known as exposition, development, and recapitulation (Caplin, 1998; Hepokoski and Darcy, 2006; Rosen, 1988; Schoenberg, 1967). In the exposition, after presenting the main theme in the home key, the music modulates to a secondary key to present the secondary theme, as well as additional closing material. In Classical sonata movements in the major mode, the secondary key is most typically the dominant key (symbolized “V”). The development section has no standard tonal scheme, and may visit an unpredictable number of local keys. Yet, many development sections begin in the dominant key and end with a return to the main key (“I”). The recapitulation section repeats much (or all) of the expositional material, but typically remains in the main key throughout (although local deviations to other keys are possible). A further closing section—“coda”—is optional and remains, as a rule, in the main key as well.
The Classical sonata form is chiefly associated with fast movements, typically found among the first movements of multi-movement cycles. While this sonata type (to which we refer as “sonata-allegro”) may be considered to represent the generic sonata, there are several sonata-form variants that are close enough to this generic type to be subsumed under a broader “sonata category.” Slow movements by Mozart and Beethoven are often in sonata form, but they typically evince a more concise design than sonata-allegro movements. Beethoven’s fast sonata movements are quite often preceded by a slow introduction which, necessarily, influences the distribution of tonal zones across the piece’s timeline with respect to the more customary design without introduction. In the so-called “sonatina” form, or “Type 1 sonata,” as in Hepokoski and Darcy (2006), there is no development (or just a short passage substituting for the development section). An additional category of sonata-related formal designs is that of the sonata-rondo, which combines structural principles of the rondo and the sonata form. In terms of its tonal trajectory, the sonata-rondo is linked to the generic sonata in that it features a modulation to the secondary key in its expositional episode, and a subsequent tonic-key re-invocation of secondary-key material in the recapitulatory one (Hepokoski and Darcy, 2006). In our various classification tasks, we subsume all particular variants of the generic sonata form under a broad sonata label.
Ternary (or ABA) form designs involve the concept of a main part (A) which is presented twice, with a different, often contrasting middle section (B) interpolated between these two presentations. Although there is an apparent similarity between the ABA layout and that of the sonata form, in that the latter has its development section interpolated between the exposition and the recapitulation, there are several crucial differences. Most importantly, as opposed to sonata expositions, which invariably close in the secondary key, the ABA form’s initial A section (even where not literally identical with the final A section) closes in the home key. Ternary designs are present, for instance, in some Classical slow movements, mostly in connection with a contrasting middle section in a different key. However, the most common Classical ABA forms are the minuet and the scherzo forms. The seventeenth- and eighteenth-century minuet was gradually replaced in the nineteenth century by the scherzo. Both movement types are characterized by a main minuet (or the scherzo’s main section) which is repeated in its entirety at the end of the movement, whereby the middle is occupied by an often contrasting “trio” section. The trio’s key more often than not differs from that of the main minuet, common options being the subdominant (“IV”), the parallel minor (“i”), and the relative minor (“vi”). Accordingly, minuet/scherzo movements typically (though not always) feature a global tonal contrast between the main minuet’s key and that of the trio.
Finally, variation movements have a quite unique formal-tonal layout. The theme presented at the beginning of such movements is typically in the rounded binary form, which is closely related with, but more concise than the sonata form. It is then followed by a chain of variations, each of which has the same length and tonal structure as the theme, and is—in the Classical period almost invariably—also in the same key. The only exception to this rule is a minor-mode variation in the parallel minor of the main key (in major-mode movements), which typically occurs about halfway into the movement (the exact location of the minor-mode variation may vary). The entire movement often closes with a “coda” prolonging the final variation.
As may be gleaned from Figure 1, representing a comparison of the three prototypical formal designs discussed above, the temporal organization of tonal areas (or keys) throughout a given movement is expected to differ substantially among the three forms. Whereas ABA movements have a symmetrical tonal design (the main key flanking the trio’s—or B-section’s—key on both sides), sonata movements feature their central tonal contrast already in their first section (exposition), resolving it in their final section (recapitulation), with the middle section (development) consisting of an unpredictable succession of local keys. Finally, variation movements reproduce, once and again, the same tonal structure (centered, as a rule, around the main key), typically interrupting it through a minor-mode variation located around the movement’s middle. We expect the general differences of tonal organization among the three forms to be reflected in actual pieces’ histograms, in particular in histograms based on the sliding-window technique.
Having said that, it is important to note that not every single piece of the repertoire under analysis abides by these strict tonal trajectories. Particularly in Beethoven—often considered an early-Romantic composer, rather than a Classical one—there are notable deviations. For instance, the first movement of the Waldstein Sonata Op. 53—in every other sense a regular sonata-allegro movement—modulates to the major-mode key on the third scale degree as its secondary key, instead of the expected dominant key, while the variation movement from the last Piano Sonata Op. 111 lacks the customary minor-mode variation, etc.
Ultimately, given that a piece’s formal layout—by and large—manifests itself in its tonal trajectory, one may opt for modeling musical form on the succession of local keys throughout a given piece (rather than the distribution of pitch classes, as we propose to do here). In fact, a steadily growing body of expert harmonic analyses of Classical pieces makes it possible to extract ground-truth data on the succession of local keys across entire corpora of music as a preliminary step toward classification/clustering experiments. However, in our present contribution we opt for operating directly on the raw data (pitch-class distributions), while confining the use of expert annotations to the level of whole-piece data (a piece’s key and form label). In Section 5.2.2, we nonetheless contemplate the possibility of employing expert annotations of local keys as a perspective for further research.
We base our dataset on all piano-sonata movements by W. A. Mozart (totalling 54)1 and L. v. Beethoven (totalling 102).2 For this entire corpus we provide metadata, including a form label and a global key for each movement. A file containing our metadata is available on our GitHub page.3
The metadata is based on expert annotations provided by the authors. Main-key information at the whole-piece level is common knowledge and entails no ambiguity with regard to the repertoire under investigation. While a more complete MIR workflow might be expected to include a function to extract key information algorithmically, the key-finding algorithms featured in the code library used in the present research4 did not yield sufficiently accurate results—not even in conjunction with this small (and, tonally speaking, fairly conservative) corpus—in order to be integrated into the present workflow.
In our metadata, we assigned a total of twenty one unique form labels to individual movements across the corpus, representing a high-resolution differentiation among formal subcategories. Fourteen of these twenty one subcategories map onto the three main form categories discussed in the previous section: sonata, ABA, and variations.
Importantly, the dataset used in the present analysis is comprised only of the 122 movements in major-mode keys (48 by Mozart, 74 by Beethoven). Since pitch-class histograms are crucially influenced by the choice of a movement’s mode, and, additionally, certain forms—such as the sonata form—have different tonal trajectories in major and in minor, incorporating minor-mode movements in the corpus under analysis would interfere critically with the ability to perform separation on the basis of a movement’s form. In fact, the implications on a piece’s structure of being in the minor mode are so overriding that we proceed from the assumption that, given a corpus containing a sufficient number of minor-mode pieces, one would opt for devising a separate set of form labels for major- and minor-mode forms.
Of the major-mode pieces in the corpus under analysis, a total of 76 movements correspond to the generic “sonata” label, 20 are labeled “ABA,” and 7 are labeled “variations.” We include in our metadata yet another seven form labels, covering another nineteen movements, including fugues, rondos (not sonata-rondos!), free forms, etc. While all of these “miscellaneous” formal layouts are incompatible with our three generic labels, none of them amounts to a generic formal category in its own right. We accordingly choose to leave the corresponding movements—which represent 16% of the total 122 major-mode movements of our corpus—unlabeled in our analysis.
As a preliminary step, we converted all individual movements from their original MuseScore format to MusicXML. Then, prior to extracting pitch-class distributions from these latter files, we had to give some thought to the question of passages notated only once but played multiple times. In fact, all classical forms discussed in Section 2 normally incorporate literal repeats of entire sections designated by either repeat signs, or, in ABA forms, using the instruction Da capo (meaning going back to the beginning of the piece). However, in certain movements composers choose to omit the customary repeat signs, meaning that passages that would otherwise be repeated are heard only once. To establish a uniform treatment of all movements of a given form, we disregard repeat signs, and include each notated passage only once. On the other hand, we append Da capo repetitions, because such repetitions define the ternary form as such (note that in many movements in ABA form, the final “A” section is not written out but rather indicated using a Da capo sign). Finally, a particular type of repetition in conjunction with repeat signs involves additional bits of music known as “voltas.” Although our decision to ignore repeat signs also renders “voltas” redundant, we retain these bits, while omitting the literally repeated measures.
All movements participating in the analysis were transposed to C major, such that all interim keys were transposed accordingly, without wrapping of keys with many accidentals. Notably, some current approaches to working with symbolic data, as, for example, by Lieck and Rohrmeier (2020), do not require previous knowledge of a piece’s tonic key, and, accordingly, do not apply key-unifying measures, employing, instead, transposition-invariant metrics. However, in the present study we opt for bringing all pieces into line using a unifying transposition.
MusicXML data (unlike MIDI) distinguishes between different enharmonic spellings of a given pitch class. Note that enharmonically equivalent pitch spellings can mean musically very different things. For instance, while the pitch E♭ in C major is often tantamount to moving to the parallel minor—a procedure associated with specific formal locations—its enharmonic equivalent D♯ most often boils down to merely ornamental chromaticism and, as such, has little formal implications. Accordingly, in our modeling we use the richer features of enharmonic spellings supported by the MusicXML format.
As a result of our decision to distinguish between enharmonic spellings on the one hand, and unify datapoints through transposition to C major on the other, the corpus under analysis does not abide by the twelve pitch classes emblematic of MIDI data, but entails a larger number of “pitch spellings”—some of which are rather off-centered using × and ♭♭ signs. The pitch spellings used in our data amount to a total of twenty six (see Figure 2 for details). The extremely off-centered pitch spellings occur in conjunction with distant modulations in pieces whose principal keys are already quite distant from C major to begin with. For instance, the B♭♭ in our note-level representation of Beethoven’s Sonata Op. 90, 2nd movement originates from a local deviation to B♭ minor in an E major movement (incidentally, as can be seen in this figure, the occurrences of this pitch spelling throughout the corpus are quantitatively negligible). Intriguingly, the great majority of off-centered pitch spellings stem from works by Beethoven, allowing for Mozart’s use of a more conventional range of global keys and interim modulations.
We used the music21 toolkit5 to extract whole-piece histograms and sliding-window histograms for each movement of the corpus. With regard to the sliding-window technique, this procedure consists of determining a window size (expressed as a percentage of the piece’s duration) and the desired number of windows, and computing the start and end positions for each window (note that, depending on the settings of these parameters, windows may partially overlap). Subsequently, for each window, we count all notes that fall (entirely or partially) within its time span, weighting each occurrence by its duration. For notes falling partially outside the window, we take only the portion within the window into account. Finally, the entire histogram derived from a given window is normalized, such that all entries corresponding to the various individual pitch spellings add up to 1. This normalized format warrants comparability across pieces of differing lengths as well as across windows embodying a different volume of note durations.
Each of the histograms thus derived is then treated as a vector of 26 real values in the range [0,1] (zero stands for pitch spellings absent from a given histogram). The histograms representing individual windows in a given piece are then concatenated according to their temporal order to form a vector of length 26 times the number of windows. Note that a whole-piece histogram representation of a given piece is simply a special case of the sliding-window one, where there is only a single window of size 100%.
The choice of the size and number of sliding windows poses a crucial methodological decision for our analysis. Notably, for very short pieces, if the window size is too small, each window will correspond to a very short time span (for example, a single measure), such that the resulting histogram will be extremely sensitive to melodic detail, and may fail to reflect the piece’s large-scale tonal plan. Likewise, if the window size is too large, the resulting level of granularity may not enable us to detect tonal changes that demarcate important formal boundaries. It appears that a good size for a sliding window to sustain our form classification task would fall within a range of 10%–20% of the piece’s size.
Another consideration involves the number of sliding windows, which is a corollary of the window size on the one hand and the desired amount of overlap between adjacent windows on the other. While overlap evidently entails potential redundancy, a certain amount of overlap is desirable, because a zero overlap makes it hard to detect tonal events occurring at the boundaries between windows. We consider an overlap of 50%–90% of the individual window’s size to represent a useful range for our analysis.
Having defined these two ranges—for both the window size and the overlap between adjacent windows—we opted for experimenting with, and reporting results (in the supervised learning experiment) for a total of nine configurations based on different settings for these two parameters. For the individual configurations, see Table 1.
We performed supervised classification on the piano-sonata movements under investigation according to the three main form categories described above. We compared the performances of two supervised models—Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs).
To prevent overfitting due to the small amount of data, we employed Leave-One-Out cross-validation. This involves conducting as many tests as there are labeled items (103). For each iteration, the test set consists of a single item, while the training set consists of all the remaining 102 labeled items.
We perform this training-and-testing procedure both on the whole-piece data and the sliding-window data. With regard to the sliding-window data, after obtaining the feature vectors (as described in Section 3.2.2), we reduced the training set’s dimensionality to 26 using Principal Component Analysis (PCA) to match the number of dimensions of the whole-piece histogram representation (for which no dimension reduction has yet been performed at this stage of the analysis). By granting both these datasets an identical number of dimensions, we claim to provide uniform conditions for comparing the performance of our classification algorithm on both sets. After training the model and prior to classification, the feature vector of the test set was projected onto the same lower-dimension space defined by the Principal Components derived from the training set to yield a data point of similar dimensions.
Given that some of the 26 entries in the whole-piece representation stand for extremely rare pitch spellings which appear only in a very small subset of the corpus (as shown in Figure 2), and are thus potentially redundant, we opted for a dimensionality reduction of the whole-piece data from 26 to 12. The dimensionality 12 was chosen in analogy to widespread MIDI-pitch based analyses (Lieck and Rohrmeier, 2020; Quinn and White, 2017; Sapp, 2005). Accordingly, we repeated our entire classification experiment twice: at 26 and at 12 dimensions. For the lower-dimensionality experiment, sliding-window data was reduced directly to 12.
For the Support Vector Machine (Vapnik, 1998), we used the SVM model available from scikit-learn with an RBF kernel, the default value for gamma, and tested various values for C—the regularization parameter—as described in the Results section.
In order to compare the performance of the SVM method with that of Artificial Neural Networks (Briot et al., 2020; Goodfellow et al., 2016), we repeated the entire experiment using ANNs. We employed the MLPClassifier module, using a hidden layer of size 8, the default ReLu activation function, and the L-BFGS solver, which is recommended for small datasets.
A Gaussian Mixture Model (GMM) (Reynolds, 2009) is an unsupervised learning method that treats data points as the emissions of naturally occurring probability distributions, and attempts to extrapolate the most likely probability distributions. By asserting that the data points are samples drawn from a certain number of distributions, or “components,” one may then proceed to estimate which data point originated from which component.
We used the GaussianMixture module of scikit-learn to model all the labeled movements in our corpus as drawn from three distinct components, expected to roughly correspond to the three generic form labels used in our analysis (sonata, ABA and variations). Prior to clustering, we used PCA to reduce dimensionality to three. The analysis was run on both whole-piece and sliding-window data, whereby the latter analysis was based on splitting each piece into 58 windows of size 15%. 100 iterations of the GMM clustering were performed on each of the two representations.
SVM supervised classification using the whole-piece histogram data (26-entry vectors) resulted in an accuracy of 74%. This result may be considered weak, as it corresponds to the naive baseline (74%) of classifying all pieces as sonata movements (based on the fact that this is the largest category). The same model attained an accuracy of 72% when PCA was performed to reduce dimensionality to 12 prior to training the SVM model.
Using sliding-window histograms with dimension reduction to 26 (as explained in Subsection 3.3.1) resulted in accuracy scores ranging from 83% to 88% across the nine configurations tested for different window sizes and overlap, with an average accuracy of 86%. Representing these results by mean and SD, we report 86% ± 1.82%.
When reducing dimensionality to 12, the accuracy scores were in the same range (83%-88%), with an average accuracy of 85%. It appears that smaller window sizes correlate with slightly higher accuracy scores, with the size of overlap between windows having no apparent effect on the score (but this trend is not entirely stable across all parameter configurations).
Regularization parameter values: Higher values for the regularization parameter C mean that during the training phase, correct classification of data points according to their actual labels is prioritized, even at the cost of a more complex separator. We repeated our classification experiment with several different C values. It thereby became apparent that, when using C values below the default value 1, classification based on sliding-window data did not fare any better than that based on whole-piece data (accuracy around 74%). However, for the default value 1, the sliding-window representation achieved noticeably better results than those of the whole-piece histogram data. For any C value above 2, the scores are essentially the same as those reported above, which were computed for the individual case of C = 4 (accuracy ranging from 83% to 88% for the sliding-window data).
This improved achievement for a relatively high C value is not surprising: as the dataset is rather small and inhomogeneous, we conjecture that the training set will be spread out unevenly, making the separation quite sensitive to individual outliers.
A typical confusion matrix may look like the one in Table 2. In the particular case at hand, the algorithm seems to have had particular trouble classifying the ABA movements, as only 10 ABA movements out of 20 were classified correctly. On the other hand, out of 76 movements in sonata form, all but two were identified correctly at the test phase, and all variation movements were classified correctly.
Artificial Neural Networks, an inherently different classification model from SVMs, yield similar results. For the whole-piece histogram data, the network achieved a somewhat worse accuracy score in comparison with our SVM classification experiment (66%). For the nine configurations of sliding windows, its performance under reduction to 26 dimensions was comparable to that of the SVM model, with scores ranging from 83% to 88%, averaging 85%—reported by mean and SD: 85% ± 1.63%. For dimensionality reduction to 12 dimensions, the network achieved similar scores: 66% for the whole-piece histogram data, and scores ranging from 80% to 90% for the sliding-window histograms, with mean and SD of 83.6% ± 2.87%. This result is reassuring, especially due to the fundamental differences between the two models, and corroborates our conjecture that the sliding-window representation holds useful data for classification by form, and by far surpasses the performance of the whole-piece data.
The clustering results were evaluated as follows: for each of the iterations, the three resulting clusters were compared with the actual three formal categories: sonata, ABA and variations, and the mapping which achieved the best match was selected. The GMM clustering for these data typically tended to converge to one of several possible options for each selection of parameters. Over the 100 iterations, the accuracy scores for the sliding window histograms (58 windows of size 15%) averaged at 58.9%, ranging from 44.7% to 64.1%. The scores for the whole-piece histograms fared even worse, with an average accuracy score of 50.5%, ranging from 41.7% to 63.1%. This result is, for both the whole-piece and the sliding-window data, significantly worse than the naive baseline of classifying all pieces as sonata form movements.
Thus, we cannot infer the existence of a connection between the unsupervised clustering results and the actual three form categories. Furthermore, music-theoretical examination of the clustering results fails to pinpoint any particular structural property which may correlate compellingly with this clustering.
Our results generally point to the viability of using pitch-class histograms to model musical form. At the same time, our classification experiments show sliding-window data to outperform the whole-piece histogram data by far, a result which agrees with our hypothesis formulated at the outset of this paper. Especially encouraging are the very similar results for both supervised models, indicating that the success on classifying by formal categories lies in the data representation, rather than the analytical model employed. We consider the near-identical performance of the SVM and the ANN models to indicate the solidity of our method and the correctness of our hypothesis, rendering the application of further supervised learning models to the same task non-essential for the present precursory study.
With regard to the unsupervised learning task, separation into three categories yielded poor results for both whole-piece and sliding-window data. This may be traced back to the possible presence of other, more dominant features in the data representation, which interfere with the algorithm’s ability to focus on form-related features. However, at the present stage of investigation we do not have clear indication as to what structural features of the music may have played a role in the results of our GMM clustering experiment. We consider the unsupervised part of our investigation to be of an exploratory nature—given the inconsequential results, we do not consider it worthwhile to implement additional unsupervised models for the present task.
Given the success of both supervised learning algorithms in classifying pieces by form labels, one may expect to observe a certain natural separation into the three generic form categories in visual representations, too. Arguably, the PCA visualization of our dataset shown in Figure 3, representing a three-dimensional reduction of our sliding-window data, suggests that data points pertaining to the same form label are, on the whole, grouped reasonably well together.
A further interesting insight into our sliding-window data may be gleaned from averaging all pitch-class vectors for each of the three investigated form categories, and examining the behavior of each of the individual 26 pitch spellings over time. Figures 4, 5 and 6 show the relative weight of several selected pitch spellings across 90 windows of size 10%, averaged for all movements of the same form category. Examining Figure 6 (variation movements), it is hardly surprising to find out that the pitch classes C and G dominate throughout (because one never quits the main key for longer than a few measures), with E♭ peaking about halfway into the movement, that is, at the customary position of the minor-mode variation. In comparison, Figure 4 (sonata movements) shows a more dynamic behavior, with many lines crisscrossing one another. Especially interesting is the increased usage of F♯ in the first third of the piece, corresponding to the expositional modulation to the dominant key. Later on, shortly before the piece’s middle, that is, at the approximate onset position of the development section, the diagram is characterized by a sudden proliferation of pitch classes relatively distant with regard to the pitch collection of C major. As for Figure 5 (ABA movements), it may be noticed from the behavior of several individual pitch classes over time that there is an underlying symmetry between the first and the final third of the timeline representation, corresponding to the initial and the final A section of the ternary form.
Contemplating potential caveats of our methodology, the variety of pieces’ lengths across the corpus has possibly problematic implications for the sliding-window representation. For instance, in the case of larger-scale movements, sliding-window data runs the risk of being insensitive to important local tonal events; with regard to shorter pieces, on the other hand, there is the risk of being overly sensitive to melodic and harmonic details that have no bearing on the piece’s tonal and formal trajectory. However, these side-effects are arguably limited, as the tonal and formal trajectory of both longer and shorter pieces is expected to scale with the piece’s length. Ultimately, this issue merits further consideration.
A further potential methodological drawback is that of the particularly small dataset, suggesting that classification results may be influenced by hidden variables. For example, if a particularly rare pitch spelling should be used incidentally only in conjunction with pieces in, say, ABA form, the supervised learning model would come to associate it with the ABA form, although the rare pitch spelling in question might just as well be connected with the piece’s absolute key, rather than its structure. Although this undesirable effect is theoretically possible, we expect it to be mitigated by the dimensionality reduction, which is blind to formal classification, as well as by the Leave-One-Out cross-validation method. We further consider it unlikely that the dataset should incorporate particularly dominant hidden variables, since the data–—which spans many decades and stems from two quite dissimilar composers–—exhibits significant variance within each of the form categories investigated.
Finally, while our SVM and ANN analyses achieved good results with regard to identifying sonata and variation movements, results on ABA movements were rather poor, with only about 50% of the movements in this form classified correctly (cf. Table 2). Notably, as can be gleaned from Figure 1, of the three generic forms labeled in our study, ABA is the only one whose layout involves no fixed subsidiary key. By contrast, sonata and variation movements of the repertoire under investigation entail characteristic subsidiary keys: the dominant and the parallel minor key, respectively. This means that ABA movements, while sharing a similar overall design, have little in common in terms of tonal trajectory, because the key of the middle section varies from case to case. This possibly explains why a supervised learning algorithm trained on pitch-class distributions may face difficulty grasping what makes a given succession of pitches qualify as an ABA movement, and will accordingly often fail to classify ABA movements correctly.
More specifically, only five of the twenty ABA movements of the corpus were classified correctly by both supervised learning algorithms (SVMs and ANNs) throughout the entire experiment. Three of these (Mozart’s K. 282/ii and K. 331/ii, and Beethoven’s no. 28/ii) have their middle (“B”) section in the subdominant of the main key. This shared interim key may have played a pivotal role in the way the algorithms came to model the ABA label. However, this explanation is only partly compelling. For one thing, there are another two ABA movements with middle sections in the subdominant key which were, nevertheless, often classified incorrectly—in particular, one of these, the second movement of Beethoven’s Piano Sonata no. 12, a scherzo in A♭ major whose trio (middle section) is in the subdominant key of D♭ major, was constantly mistaken for a sonata movement. What is more, two other ABA movements whose middle sections are in a different key than the subdominant were always classified correctly. Ultimately, while we have a general hunch about why our classification algorithms achieved particularly poor results with regard to the ABA form, the reasons why certain movements were classified correctly, while others were not, remain largely obscure.
To conclude, we are aware that our modeling of musical forms on pitch-class distributions may be only successful for a limited chapter in music history, a success that may be due to the fairly uniform tonal trajectories of pieces in the Classical style (until about 1830)—notwithstanding several deviations from the norm already found in some Beethoven movements as discussed above. Further investigation is required in order to assess whether histogram-based analysis may succeed in classifying later pieces of the Romantic period according to form labels, given that analyses of nineteenth-century and late-tonal pieces clearly show that formal layouts no longer correspond closely with a typified succession of local keys. We conjecture that trying to separate, say, Chopin sonata-form movements from his nocturnes based on pitch-class content will not yield remarkable results.
As noted above, the sensitivity of histogram-based analysis to a movement’s mode played a major role in our decision to exclude minor-mode pieces from our analysis altogether. A further development of our method would strive to incorporate pieces in the minor mode in form-oriented analyses as well. We proceed from the assumption that, given the implications of a piece’s mode on its tonal and formal trajectory, a machine learning algorithm attuned to handling both major- and minor-mode pieces would need a separate set of form labels for the minor mode. For instance, while Classical variation movements in major often feature one variation in the parallel minor, the situation with regard to variation movements in the minor is pretty much the same in reverse: such movements often incorporate a variation in the parallel major at some midway position. Arguably, under such conditions a supervised learning algorithm would fare better if trained on separate labels for major- and minor-mode variation movements.
Given that different formal layouts manifest themselves in different tonal trajectories, one may consider modeling musical form on the succession of local keys throughout a piece. We contemplate two possible ways of gleaning local-key information from a piece: (1) extracting the sequence of keys from the raw data (i.e., symbolic representations or audio data) algorithmically; (2) using expert annotations on local keys (e.g., as part of a more extensive harmonic analysis).
When opting for extracting a piece’s sequence of local keys algorithmically, one would benefit from the improvements achieved by recent MIR studies on global and local key identification. For instance, Nápoles López et al. (2019) approach the task of key finding by applying a Hidden Markov Model to key profiles. Improvements have also been made on handling the minor mode in key-detection tasks, as by Albrecht and Shanahan (2013).
In order to model formal processes, it may be important to estimate not only individual keys, but also local key changes. For instance, Feisthauer et al. (2020) base their modeling of key shifts on music-theoretical knowledge of proximity among different keys and standard key-establishing progressions. Tested on a selection of Classical string quartets, their model achieved over 80% success in predicting a piece’s modulation plan.
Note, however, that only some of the modulatory processes taking place throughout a given piece represent a corollary of the formal trajectory. Many music-theoretical sources distinguish between two types of modulatory processes: “tonicizations” vs. “modulations” (Aldwell et al., 2018). Whereas modulations shape the piece’s strategic tonal plan, tonicizations represent incidental, quasi-momentary key changes. A piece’s modulation plan is, by and large, characteristic of its underlying formal design. On the other hand, local tonicizations are as a rule the property of an individual movement, and as such are less indicative of the formal trajectory, which is, by definition, common to many pieces. Nápoles López et al. (2020) develop a model which succeeds in distinguishing between, and predicting occurrences of the two categories. However, their method is not specifically directed towards discarding (local) tonicizations while retaining only the more substantial modulations. Further work on separation between tonicizations and modulations has been achieved by Micchi et al. (2020) and Chen and Su (2021). Ultimately, an individual piece’s sequence of local keys entails—besides crucial information on form-related modulations—also a great deal of “noise,” that is, local tonicizations and other chromatic pitches that obscure the underlying tonal trajectory.
Another way to obtain a piece’s sequence of local keys is by expert annotations. Ground-truth data on local keys is now obtainable from several recent data projects which offer meticulous harmonic analyses linked with the note-level representations via timestamps. In fact, full harmonic analyses have been provided for most of the work corpus addressed in our study: Hentschel et al. (2021) offer a harmonic analysis of Mozart’s complete piano sonatas, while Chen and Su (2018) analyze the first movements of Beethoven’s piano sonatas. Seemingly, our experiment may be repeated based on local-key information derived from these annotations. However, proceeding in this direction would require unifying the very different annotation standards used in these two corpora. For instance, while Hentschel et al. (2021) make a clear distinction between local tonicizations and more “strategic,” form-related modulations, no such distinction is proposed by Chen and Su (2018).
We finally propose that pitch-class histogram methods, and, in particular, sliding-window histograms, may be applicable to a variety of additional MIR tasks. In previous work, various computational methods were utilized to model a composer’s personal style—even to a degree that allows to make inferences on authenticity, as in van Kranenburg (2006). Our PCA analysis of the 103 data points derived from the labeled major-mode movements of our corpus, presented in Figure 7, suggests that at least a certain degree of separation by composer may be feasible. Further work will be needed to explore the viability of sliding-window histograms for composer assignment and other machine learning tasks.
1The files of Mozart’s piano sonatas were derived from MuseScore files edited by the DCML team at the École Polytechnique Fédérale de Lausanne. These files are available at https://github.com/DCMLab/mozart_piano_sonatas.
2The files of Beethoven’s piano sonatas were uploaded in MuseScore format by user ClassicMan to https://musescore.com/user/19710/sets/54311.
5Available at https://web.mit.edu/music21/.
6Available at https://scikit-learn.org.
This project is supported by The Israel Science Foundation (grant No. 480/17) and by Tel Aviv University. We thank the editor and the anonymous reviewers for their insightful comments. We further thank Dr. Moni Shahar of the Tel Aviv University Center for AI and Data Science (TAD) for his help in carrying out the statistical analysis and to Johannes Hentschel, Dr. Robert Lieck, and Dr. Fabian C. Moss of the École Polytechnique Fédérale de Lausanne Digital and Cognitive Musicology Lab, and to Dr. Markus Neuwirth of the Anton Bruckner University, Linz, for sharing with us their expertise and suggestions.
The authors have no competing interests to declare.
Albrecht, J., and Shanahan, D. (2013). The use of large corpora to train a new type of key-finding algorithm: An improved treatment of the minor mode. Music Perception: An Interdisciplinary Journal, 31(1): 59–67. DOI: https://doi.org/10.1525/mp.2013.31.1.59
Allegraud, P., Bigo, L., Feisthauer, L., Giraud, M., Groult, R., Leguy, E., and Levé, F. (2019). Learning sonata form structure on Mozart’s string quartets. Transactions of the International Society for Music Information Retrieval (TISMIR), 2(1): 82–96. DOI: https://doi.org/10.5334/tismir.27
Bigo, L., Giraud, M., Groult, R., Guiomard-Kagan, N., and Levé, F. (2017). Sketching sonata form structure in selected classical string quartets. In International Society for Music Information Retrieval Conference.
Briot, J.-P., Hadjeres, G., and Pachet, F.-D. (2020). Deep Learning Techniques for Music Generation. Springer. DOI: https://doi.org/10.1007/978-3-319-70163-9
Chen, T.-P., and Su, L. (2018). Functional harmony recognition of symbolic music data with multi-task recurrent neural networks. In International Society for Music Information Retrieval Conference, pages 90–97.
Chen, T.-P., and Su, L. (2021). Attend to chords: Improving harmonic analysis of symbolic music using transformer-based models. Transactions of the International Society for Music Information Retrieval, 4(1): 1–13. DOI: https://doi.org/10.5334/tismir.65
Fucks, W. (1962). Mathematical analysis of formal structure of music. IRE Transactions on Information Theory, 8(5): 225–228. DOI: https://doi.org/10.1109/TIT.1962.1057746
Gedik, A. C., and Bozkurt, B. (2010). Pitch-frequency histogram-based music information retrieval for Turkish music. Signal Processing, 90(4): 1049–1063. DOI: https://doi.org/10.1016/j.sigpro.2009.06.017
Hentschel, J., Neuwirth, M., and Rohrmeier, M. (2021). The Annotated Mozart Sonatas: Score, harmony, and cadence. Transactions of the International Society for Music Information Retrieval, 4(1): 67–80. DOI: https://doi.org/10.5334/tismir.63
Hepokoski, J., and Darcy, W. (2006). Elements of sonata theory: Norms, types, and deformations in the lateeighteenth-century sonata. Oxford University Press. DOI: https://doi.org/10.1093/acprof:oso/9780195146400.001.0001
Jiang, D.-N., Lu, L., Zhang, H.-J., Tao, J.-H., and Cai, L.-H. (2002). Music type classification by spectral contrast feature. In Proceedings of the IEEE International Conference on Multimedia and Expo, volume 1, pages 113–116. IEEE. DOI: https://doi.org/10.1109/ICME.2002.1035731
Koduri, G. K., Gulati, S., Rao, P., and Serra, X. (2012). Rāga recognition based on pitch distribution methods. Journal of New Music Research, 41(4): 337–350. DOI: https://doi.org/10.1080/09298215.2012.735246
Krumhansl, C. L., and Kessler, E. J. (1982). Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys. Psychological Review, 89(4): 334–368. DOI: https://doi.org/10.1037/0033-295X.89.4.334
Lidy, T., and Rauber, A. (2005). Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In International Conference on Music Information Retrieval (ISMIR), pages 34–41.
Lidy, T., Rauber, A., Pertusa, A., and Quereda, J. M. I. (2007). Improving genre classification by combination of audio and symbolic descriptors using a transcription system. In International Conference on Music Information Retrieval (ISMIR), pages 61–66.
Micchi, G., Gotham, M., and Giraud, M. (2020). Not all roads lead to Rome: Pitch representation and model architecture for automatic harmonic analysis. Transactions of the International Society for Music Information Retrieval (TISMIR), 3(1): 42–54. DOI: https://doi.org/10.5334/tismir.45
Nápoles López, N., Arthur, C., and Fujinaga, I. (2019). Key-finding based on a hidden Markov model and key profiles. In 6th International Conference on Digital Libraries for Musicology, pages 33–37. DOI: https://doi.org/10.1145/3358664.3358675
Nápoles López, N., Feisthauer, L., Levé, F., and Fujinaga, I. (2020). On local keys, modulations, and tonicizations: A dataset and methodology for evaluating changes of key. In 7th International Conference on Digital Libraries for Musicology, pages 18–26. DOI: https://doi.org/10.1145/3424911.3425515
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Édouard Duchesnay. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825–2830.
Quinn, I. (2010). Are pitch-class profiles really “key for key”? Zeitschrift der Gesellschaft für Musiktheorie [Journal of the German-speaking Society of Music Theory], 7(2): 151–163. DOI: https://doi.org/10.31751/513
Quinn, I., and White, C. W. (2017). Corpus-derived key profiles are not transpositionally equivalent. Music Perception: An Interdisciplinary Journal, 34(5): 531–540. DOI: https://doi.org/10.1525/mp.2017.34.5.531
Reynolds, D. A. (2009). Gaussian mixture models. Encyclopedia of Biometrics, 741: 659–663. DOI: https://doi.org/10.1007/978-0-387-73003-5_196
Sapp, C. S. (2005). Visual hierarchical key analysis. Computers in Entertainment (CIE), 3(4): 1–19. DOI: https://doi.org/10.1145/1095534.1095544
Schreiber, H., Weiss, C., and Müller, M. (2020). Local key estimation in classical music recordings: A cross-version study on Schubert’s Winterreise. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 501–505. IEEE. DOI: https://doi.org/10.1109/ICASSP40776.2020.9054642
Sears, D., Caplin, W. E., and McAdams, S. (2012). Perceiving the classical cadence. Music Perception: An Interdisciplinary Journal, 31(5): 397–417. DOI: https://doi.org/10.1525/mp.2014.31.5.397
Shmulevich, I., and Yli-Harja, O. (2000). Localized key finding: Algorithms and applications. Music Perception, 17(4): 531–544. DOI: https://doi.org/10.2307/40285832
Smith, A. (1973). Feasibility of tracking musical form as a cognitive listening objective. Journal of Research in Music Education, 21(3): 200–213. DOI: https://doi.org/10.2307/3345090
Temperley, D. (1999). What’s key for key? The Krumhansl-Schmuckler key-finding algorithm reconsidered. Music Perception, 17(1): 65–100. DOI: https://doi.org/10.2307/40285812
Tzanetakis, G., and Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5): 293–302. DOI: https://doi.org/10.1109/TSA.2002.800560
Tzanetakis, G., Ermolinskyi, A., and Cook, P. (2003). Pitch histograms in audio and symbolic music information retrieval. Journal of New Music Research, 32(2): 143–152. DOI: https://doi.org/10.1076/jnmr.188.8.131.5243
Weiß, C., Klauk, S., Gotham, M., Müller, M., and Kleinertz, R. (2020). Discourse not dualism: An interdisciplinary dialogue on sonata form in Beethoven’s early piano sonatas. In International Society for Music Information Retrieval Conference, pages 199–206.
Zhu, Y., and Kankanhalli, M. S. (2006). Precise pitch profile feature extraction from musical audio for key detection. IEEE Transactions on Multimedia, 8(3): 575–584. DOI: https://doi.org/10.1109/TMM.2006.870727