Pitch-class distributions (PCDs) are fundamental objects of study in many fields of empirical music research, from music information retrieval (MIR), mathematical music theory, and computational musicology to music perception, cognition, and corpus studies (Huron and Veltman, 2006). They are closely related to relevant cognitive and musical properties of the pieces, such as their key, degree of consonance and dissonance, and characteristics of the modulation plan. A PCD may thus be understood as a fingerprint of a piece that contains important information about its tonality.
However, despite the meaningful structure of PCDs, many existing approaches do not make the relevant aspects explicit in the way PCDs are modeled and represented. In fact, PCDs are mostly treated as fixed templates, whose structure is only interpreted in a post-hoc manner. This is also reflected in the way PCDs are commonly visualized, namely as categorical distributions of twelve chromatic semitones, where visual proximity does not necessarily reflect musical relations. Ignoring music-theoretical insights about the structure of tonal space for representing and visualizing PCDs conceals deeper structural relations that are reflected in these distributions.
The goal of this paper therefore is to provide a compact, structured and interpretable representation of PCDs. The proposed Tonal Diffusion Model (TDM) achieves this in two ways: first, by leveraging relevant music-theoretic and cognitive insights, specifically, the algebraic representation of tonal space called the Tonnetz (Hostinský, 1879; Cohn, 1997); and second, by defining an explicit statistical model, which independently generates all tones in a piece by starting at a piece-specific tonal origin and taking steps in different directions on the Tonnetz.
We quantitatively evaluate our model against several baseline models on a corpus of 248 pieces from the Baroque, Classical, and Romantic era, demonstrating that the TDM provides music-theoretically reasonable interpretations. Moreover, it models the empirical PCDs more accurately than the unstructured baseline models as measured by the Kullback-Leibler divergence (KLD) to the empirical PCD.
Many large corpora are available in the MIDI format, which encodes pitch through an integer representation, commonly interpreted as twelve-tone equal temperament (12TET) (Selfridge-Field, 1997). MIDI thus has the major drawback that it does not preserve the musically relevant distinctions between enharmonically equivalent pitch classes that are expressed in pitch spelling. This simplified representation has largely carried over to PCDs, which are commonly visualized in a linear chromatic arrangement.
This implies three shortcomings (see Appendix A for a more detailed discussion): 1) the visual representation in a linear arrangement suggests some implicit ordering and proximity relation that is not explicitly modeled; 2) the commonly used chromatically ascending order does not reflect tonal relations well, especially with respect to harmony and key, where an arrangement along the line or circle of fifths would be better suited; and 3) the space of pitch classes exhibits an inherent cyclic topology, which is not reflected in a linear arrangement. The TDM overcomes these limitations by using the Tonnetz as basis representation for PCDs, which preserves pitch spelling, and by representing proximity relations via steps on the Tonnetz. Moreover, the Tonnetz extends the purely fifth-based representation by additionally allowing for proximity relations based on major and minor third steps.
The pervasive categorical representation of PCDs is also used to display the results of psychological probetone experiments where participants rate different degrees of stability of tones in 12TET (Krumhansl and Kessler, 1982). The underlying assumption in these experiments is that enculturated listeners have acquired mental representations of tonality through exposure to music via statistical learning, which approximately corresponds to inferring the note distributions in musical corpora (Krumhansl, 1990; Huron and Veltman, 2006).
Several studies have shown that PCDs suggest an arrangement that is essentially equivalent to the Tonnetz (see Section 3.2). Krumhansl and Kessler (1982) use multidimensional scaling (MDS) to find a toroidal structure of the 24 key profiles, obtained by transposing the major and minor profiles to all twelve keys. The appropriateness to model tonal space as a torus has a long-standing tradition in music theory and is supported by these empirical findings.
The cognitive relevance of this structure has been further investigated by Krumhansl (1998), where the Tonnetz is used to model the perceived distance of triads. Toiviainen and Krumhansl (2003) use a selforganizing map, which essentially reproduces the results of the MDS. Their study extends prior research by modeling listeners’ dynamic perception of tonality in a musical piece over time, which is reflected in high activation of certain regions on the Tonnetz.
Milne and Holland (2016) compare three different models for perceived triadic distance, namely the psychoacoustic measure of spectral pitch-class distance, distance on the Tonnetz, and voice-leading distance between the triads. They conclude that the Tonnetz is highly predictive for empirical data of triadic similarity and correlates strongly with the psychoacoustic measure.
Our approach builds upon these prior works on the cognitive relevance of the Tonnetz by using it more generally to represent empirical PCDs.
The basic structure of the TDM (see Section 4) is similar to that of probabilistic topic models (Blei et al., 2003; Steyvers and Griffiths, 2007), which have successfully been employed for the prediction of word frequencies in natural language. Topic models represent a corpus as a bag (an unordered collection) of documents (e.g. texts or pieces) and each document as a bag of items (e.g. words or notes). However, language models do not incorporate the kind of structure present in music, which can be expressed in algebraic (Fiore and Noll, 2011) and geometric (Tymoczko, 2011) models. Most topic models for language assume no relation between words and topics besides their probability of co-occurrence, even though there are some attempts to overcome this limitation (Griffiths et al., 2005).
A topic model for music should therefore go beyond those developed for natural language by explicitly modeling the structural relations of pitch classes in tonal music and incorporating the geometric and algebraic structures known from music theory as well as insights from music cognition.
The only music-specific application of topic models we are aware of is by Hu and Saul (2009a, b). However, while sharing the general topic model structure with our TDM, there are two profound differences (see Appendix B for a more detailed discussion): 1) Hu and Saul address the problem of key finding and learn two static key profiles in twelve enharmonically equivalent pitch classes, whereas our model aims to provide a more fine-grained representation of tonality that goes beyond a binary major/minor classification, also taking into account pitch spelling. 2) Hu and Saul infer separate key labels for different sections of each piece, while our goal is to compactly represent the global harmonic character of the entire piece.
Consequently, while Hu and Saul evaluate their model by comparing its classification accuracy to that of established template-based key finders, we use the KLD to the empirical distribution as a performance measure, which makes a direct comparison of the present approach with that of Hu and Saul difficult. Note, however, that our baseline model with two static key profiles is conceptually very similar to the model by Hu and Saul.
Based on the music-theoretical considerations discussed above, we present a cognitive interpretation of the Tonnetz and describe a model of the latent cognitive process, grounded in the perception of tonal music.
We represent tones on the line of fifths (Temperley, 2000). That is, the tonal space ℑ is the set of tones generated by taking an arbitrary number of steps from a given reference tone by either descending or ascending a perfect fifth (see Figure 1). Assuming octave equivalence, tonal pitch classes (TPCs), used for labeling notes in Western music, have a one-to-one mapping to the line of fifths (second row in Figure 1). The set of intervals ℐ, describing the distance between two tones, is the set of directed TPC intervals (third and fourth row in Figure 1). We use ‘+’ and ‘–’ to indicate the direction (ascending or descending), describe the generic intervals (Clough and Myerson, 1985; Harasim et al., 2016) with Arabic numerals (1: unison, 2: second, 3: third, 4: fourth, 5: fifth etc.), and denote the quality of the intervals (diminished, minor, perfect, major, and augmented) with ‘d’, ‘m’, ‘P’, ‘M’, ‘A’, respectively.
Note that in this representation, we are able to distinguish enharmonically equivalent intervals, such as +A4, the ascending augmented fourth (tritone), and +d5, the ascending diminished fifth. This is a musically and cognitively relevant distinction: since the interval +d5 commonly resolves inwards into a third (+M3 or +m3) and +A4 (the tritone) commonly resolves outwards into a sixth (+m6 or +M6), these intervals create different harmonic expectations that imply different cognitive representations. As discussed in Section 2.1, a major drawback of the commonly used MIDI format is that it cannot be used to represent these musically important distinctions.
Our goal in this paper is to provide a statistical model for music that allows to describe tonal pitch class distributions of musical pieces in a compact, structured and interpretable way. In the following, the term pitch class always refers to TPCs and the term interval refers to directed TPC intervals.
While the line of fifths connects all possible pitch classes and is of central importance in tonal music (Gárdonyi and Nordhoff, 2002; Temperley, 2000; Weber, 1851), there are other ways of relating the same two tones via one or more interval steps. Of particular importance for harmonic relations in tonal music are, apart from the perfect fifth (±P5), also the major third (±M3) and the minor third (±m3), both ascending and descending (Aldwell et al., 2010; Haas, 2004). We call these the primary intervals and for our evaluation of the TDM we will restrict the set of intervals ℐ to the primary intervals. Figure 2 shows these primary interval relations with respect to the central tone C.
The infinite expansion of this graph is called the Tonnetz (Bigo and Andreatta, 2016; Cohn, 1997, 2012; Harasim et al., 2019) and has a number of historic precursors (e.g., Euler, 1739; Hostinský, 1879; von Oettingen, 1866; Riemann, 1896). Using the representation as a hexagonal graph (Douthett and Steinbach, 1998), PCDs can be plotted as color-coded heatmaps (Moss, 2019). Figure 3 shows the distributions of three pieces from three different musical epochs, which we also use for our evaluation below, namely:
In the case of tonal pitch classes, the Tonnetz is topologically equivalent to the Spiral Array (Chew, 2000). In the two-dimensional visualizations in Figure 3, where this structure is ‘unrolled’, the same tonal pitch class may appear multiple times and is then colored identically.
When looking at the distributions in Figure 3 it becomes obvious that the different pieces exhibit characteristic structures along the different axes of the Tonnetz. Bach’s piece is largely distributed along the line of fifths (horizontally) with its tonic C occurring most frequently. Beethoven’s Sonata movement is also distributed along the line of fifths but it exhibits a considerably wider spread that is roughly symmetric about the most frequent tone A (the tonic of the dominant key). Furthermore, the pitch classes of the tonic D-minor and the dominant A-major triads occupy most probability mass, resulting in the major third axis (F–A–C♯) being more pronounced than in Bach’s piece. Finally, the pitch classes in Liszt’s piece are most widely distributed, covering a range from F♭ to D♯♯ (24 fifths). In contrast to the other two pieces, the particularly wide distribution of pitch classes in this piece reflects the fact that it juxtaposes harmonically distant keys (in particular through mediantic relations), such as F♯, D, and B♭. These key relations are not reflected in the line-of-fifths topology but are explicitly modeled by the major and minor third axes of the Tonnetz, which suggests the Tonnetz to be a more suitable representation for this kind of extended tonality. We will discuss the tonal structures of these three pieces in more detail in Section 5.3.
A central idea of the TDM is that the generation of tones is associated to certain motions on the Tonnetz. Specifically, the Tonnetz captures harmonic relations, so that moving on the Tonnetz corresponds to typical harmonic changes occurring in a piece. These changes happen on several time scales, from large scale modulations between different sections of a piece, to harmonic progressions, and polyphony on the surface level. Motion on the Tonnetz thus reflects the deep structural dynamics of a piece.
Different paths through the Tonnetz express different harmonic interpretations of the involved tones. This distinction of harmonic functions expressed in different paths on the Tonnetz is an explicit part of our model.
For instance, the tone E can be reached from the tone C by ascending four perfect fifths or by ascending a major third. In the first case, E is conceived to be the perfect fifth of A, which is the perfect fifth of D, which is the perfect fifth of G, which, ultimately, is the perfect fifth of C, reflecting a recursive application of applied dominants, while in the second case, the tone E stands in a direct major third relation to C. We take this distinction as expressing two different interpretations of the harmonic function of E relative to C. That is, we assume that it corresponds to a difference on the cognitive level, because it relates the tones C and E in two different ways.
In some contexts, this might be reflected in different tunings and ways of hearing, as ascending four perfect fifths leads to the Pythagorean major third E, while the E reached by ascending a major third corresponds to the major third E in just intonation. However, since these different paths from C to E are indistinguishable in a notated score, our cognitive model does not require them to be physically distinct.
In the TDM, we assume that different paths have different probabilities of occurrence, that is, that some paths are more likely than others. For instance, we assume that different steps occur with different probabilities and that paths cannot have an arbitrary length. The step probabilities and the path-length distribution are explicit components of the TDM (Section 4). Altogether, these aspects reflect different ways of hearing tonal relations, as well as different stylistic characteristics of tonal music in general and of concrete pieces more specifically.
Tonal music is fundamentally characterized by the existence of one or more tonal centers that are related to each other in a hierarchical manner, for instance, by modulating through different local keys (Schenker, 1935; Schoenberg, 1969; Lerdahl and Jackendoff, 1983; Rohrmeier, 2011; Koelsch et al., 2013; Rohrmeier, 2020). In the TDM, we therefore introduce the concept of the tonal origin and assume that a unique tonal origin exists for each piece (see Section 4.2). Intuitively, the tonal origin corresponds to the single pitch class that is best suited to explain all tones in the piece by starting at this point on the Tonnetz and taking a small number of primary interval steps. There are two important caveats to keep in mind for interpreting the tonal origin: 1) Assuming a single tonal origin does not mean that the piece cannot modulate between different keys having their own respective local centers. 2) The tonal origin in our model is not necessarily equivalent to the tonic of the global key. Please see Appendix C for a more detailed discussion of these two points.
The purpose of the TDM is to model the intuitions presented in Section 3 formally and derive quantitative measures that capture the diffusion of probability mass from the tonal origin along the different axes of the Tonnetz. To this end, we define an explicit generative model, in which each tone in a piece is generated by starting at the tonal origin and taking a number of steps on the Tonnetz. As described in Section 3.3, there are many possible paths connecting two tones on the Tonnetz, which correspond to different cognitive interpretations of their harmonic relation. In our model, these different derivations are treated as a latent representation, which is marginalized out to determine the overall probability of reaching a particular tone.
In the general formulation of our model, we include the possibility of multiple tonal origins, and allow for an arbitrary set of intervals, as well as a generic path-length distribution. For the evaluation of the model on tonal music, presented in Section 5, we then make the more specific assumptions motivated in Section 3. Specifically, we assume a single tonal origin (Section 3.4) and restrict the allowed interval steps to the set of primary intervals present in the Tonnetz (Section 3.2).
The TDM has the basic structure of a topic model with two nested levels of generation, as shown in Figure 4a. On the inner level, each piece (i.e. each document) d is represented as a ‘bag of tones’ and all tones t ϵ d are generated independently, conditional on the piece-level variables β. On the outer level, the corpus D is represented as a ‘bag of pieces’ and all pieces d ϵ D are generated independently, conditional on the corpus-level parameters α. The tones t of a piece are the observed variables while γ represents any latent variables involved in the generation of a single tone t.
We extend this basic structure by splitting the corpus-level and piece-level variables into multiple distinct variables with a clear semantics and by replacing the inner generative step for a single tone with a model of the underlying latent cognitive process. The complete model is shown in Figure 4b and will now be explained in detail.
Let ℑ ≡ ℤ be the space of all possible TPCs and ℐ = {t–t′|t,t′∈ ℑ} ≡ ℤ be the space of all TPC intervals (see also Figure 1). We assume the following generative process, which corresponds to the graphical model in Figure 4b.
For each piece d in a corpus D draw
that is, the distribution of tonal origins c is drawn from a Dirichlet process with base distribution H_{c} over the tonal space ℑ and concentration parameter α_{c}; and the interval weights w are drawn from a Dirichlet process with base distribution H_{w} over the interval space ℐ and concentration parameter α_{w}. Furthermore, the path-length distribution λ is drawn from a prior with hyper-parameters h_{λ}
where the prior depends on the specific path-length distribution being used: for a Poisson distribution the conjugate prior is a gamma distribution, for a binomial distribution it is a beta distribution.
For each tone t in a piece d draw
that is, a number of steps n is drawn from the path-length distribution (Poisson or binomial); a series of latent tones τ^{0},…,τ^{n} is generated by first drawing τ^{0} from the distribution of tonal origins and then transitioning n times from τ^{i} to τ^{i}^{+1} by adding an interval; finally, the last tone τ^{n} of the sequence is observed as t.
For a given corpus D of musical pieces and corpus-level parameters α ≡ (H_{c}, α_{c}, H_{w}, α_{w}, h_{λ}) we would like to compute the maximum posterior (MAP) estimate β* of the piece-level variables β ≡ (c, w, λ) for each piece d,
where, following Bayes’ theorem, p(β;α) is the prior over piece-level variables and p(t|β) ≡ p(t|c,w,λ) is the marginal likelihood for a single tone t in the piece d with the latent variables (γ ≡ τ^{0},…,τ^{n},n) being marginalized out. Since in our model all tones within a piece are drawn independently and identically distributed, the distribution p(t|c,w,λ) needs to be computed only once per piece.
The expansion of the marginal likelihood p(t|c,w,λ) is shown in Equation (10), where p(n|λ) is the distribution over path lengths, p(τ^{i}|τ^{i}^{–1},w) are the transition probabilities in the latent cognitive process, and p(τ^{0}|c) is the initial distribution corresponding to possible tonal origins in the piece. To compute p(t|c,w,λ), we have to explicitly marginalize out the latent variables (γ ≡ τ^{0},…,τ^{n},n).
The marginal likelihood p(t|c,w,λ) has a recursive structure, which becomes apparent in Equation (10) by highlighting the intermediate terms p(τ^{i}|c,w), where the subset of latent variables τ^{0},…,τ^{i}^{–1} has already been marginalized out: the latent cognitive process is a Markov chain in the tonal space ℑ and the marginal distribution p(τ^{i}|c,w) at step i of the process can be recursively computed via dynamic programming. The path length n can be marginalized out by interleaving the dynamic programming updates with weighted increments to the final distribution p(t|c,w,λ), resulting in the procedure shown in Algorithm 1. Being able to compute the marginal likelihood p(t|c,w,λ) allows to optimize the piece-level variables c, w and λ to find their MAP estimate.
Input: Output:c,w,λ | |||
Output:p(t|c,w,λ) | |||
1: | u ← 0 | # initialize output distribution | |
2: | v ← p(τ^{0}|c) | # initialize intermediate distribution | |
3: | M ← p(τ′|τ,w) | # initialize transition matrix | |
4: | for n ∈ {0,1,…} do | ||
5: | u ← u + p(n|λ) · v | # update, marginalizing out n | |
6: | v ← Mv | # transition, marginalizing out τ^{i} | |
7: | end for | ||
8: | return u | ||
For our evaluation, we make three additional assumptions that are specific to tonal music and well-established in music theory (see Section 3): 1) we assume that only a single tonal origin per piece exists and that all tones are a priori equally likely to become the tonal origin; 2) we restrict the number of allowed interval steps to the six primary intervals present in the Tonnetz; 3) we assume a uniform prior over the path-length variable λ. These assumptions facilitate inference (see Appendix D for technical details) and imply that computing a MAP estimate becomes equivalent to a maximum likelihood (ML) estimate.
We infer values for the piece-level variables β by maximizing their posterior probability or (equivalently, due to using uniform priors) minimizing the negative data log-likelihood, cross-entropy, or Kullback-Leibler divergence (KLD). The model was implemented in PyTorch (Paszke et al., 2019) and optimization was done in two nested procedures: 1) The values for w and λ were optimized via gradient decent using the Adam optimizer (Kingma and Ba, 2015). 2) For specific values of the weight and path-length variables, w and λ, we chose the tonal origin c that minimizes the KLD (gradients are propagated through this step using PyTorch’s automatic differentiation functionality).
We evaluate the TDM in two ways: first, we introduce several baseline models (Section 5.1) and perform a quantitative comparison (Section 5.2) on a corpus of 248 pieces from different historical epochs. Second, we perform a detailed qualitative analysis (Section 5.3) on three exemplary pieces, inspecting the inferred parameters and discussing our musical interpretation of these results.^{1}
We introduce several baseline models to verify whether the structural assumptions incorporated in our model effectively improve performance. In particular, we are interested in validating the impact of the Tonnetz topology and the assumed latent process on model performance.
The two static baseline models, Static (1 Profile) and Static (2 Profiles), do not incorporate any topological information except for transposition invariance. These models also lack the semantic interpretability aimed for with the TDM and most closely resemble the categorical representation of PCDs, as used in template-based key finders or the model of Hu and Saul (2009b). The line-of-fifths model, TDM (Binomial, 1D), is a reduced version of the TDM that uses the line of fifths but not the full Tonnetz topology. To evaluate the influence of the path-length distribution we also include TDM (Poisson), a version of the full TDM with a Poisson path-length distribution, in addition to the best-performing TDM (Binomial) with a binomial path-length distribution.
The static baseline model consists of a single global PCD, that is, a fixed template or profile. The model has an individual parameter for each tonal pitch class; together these parameters determine the shape of the profile and they are trained via gradient descent on the entire corpus. The profile is individually matched to each piece (during training and for evaluation) by shifting it along the line of fifths. The model thus has a large number of continuous corpus-level parameters (one for each tonal pitch class) but only a single discrete piece-specific variable (the transposition). The optimal profile corresponds to the purple line in Figures 6a–6c.
This model is identical to the static model described above but comprises two different static PCDs. Matching with a specific piece is done by choosing the best profile and transposition. It thus has two discrete piece-specific variables, the transposition (as above) and the binary profile class. This model is conceptually very similar to conventional template-based keyfinding algorithms with two profiles, which are usually assumed to correspond to the major and minor modes – a questionable assumption that will be discussed below. The optimal profiles correspond to the blue line in Figure 6a (first profile) and Figure 6b and 6c (second profile).
To specifically test the impact of the Tonnetz topology, we introduce a reduced version of the TDM that uses only the line-of-fifths topology. This model is identical to the full TDM (with binomial path-length distribution), except that it only uses fifth steps (no major or minor thirds). The model has three piece-specific variables (one for weighting +P5 against –P5 and two for the binomial distribution) and produces bell-shaped PCDs on the line of fifths, including Gaussian and skewed bell shapes.
For the quantitative evaluation of the TDM, we used a corpus of 248 pieces that are representative for the Baroque, Classical, and Romantic era: all preludes and fugues from Bach’s Well-Tempered Clavier Vol. I & II (96 pieces), all movements from Beethoven’s piano sonatas (101 pieces), and 51 pieces by Liszt, including etudes, pieces from the Années de Pèlerinage, and the B-minor Sonata.
All models were trained on the entire corpus to minimize the Kullback-Leibler divergence (KLD) between the model and empirical PCD (or, equivalently, minimize the cross-entropy, or maximize the data likelihood). To train the two static models with corpus-level parameters, the three composers were weighted equally: each piece inversely proportional to the total number of pieces of the respective composer. The results are shown in Figure 5.
As expected, the static model with a single global PCD performs worst. The inferred profile can be seen in the detailed analysis of the single pieces in Figure 6. It is bimodal and can be understood as a combined major-minor profile with additional long tails along the line of fifths. For pieces in major mode, the lower mode coincides with the tonic and dominant pitch class, while the upper mode covers the major third. For pieces in minor mode the reverse is true, with the upper mode corresponding to the tonic and dominant, while the lower mode covers the minor third. This single profile can be interpreted as the best attempt to match all pieces in the corpus to a single profile.
Also not surprisingly, the static model with two profiles performs considerably better. Notably, for the Bach pieces this model performs as well as the TDM (Poisson) and almost as well as the TDM (Binomial). Presumably, the main reason for the strong performance of the static model on Bach’s pieces is the fact these pieces are typically confined to a relatively well-defined range on the line of fifths, after which the PCD quickly drops to zero. This shape does not correspond to the smooth decay assumed in the TDM, which is more typically observed in the pieces by Beethoven. In contrast to the relatively good performance for Bach’s pieces, for Beethoven and Liszt the static two-profile model performs even worse than the reduced line-of-fifth TDM (Binomial, 1D).
Inspecting the two inferred profiles in Figure 6 reveals an interesting property: instead of learning a major and a minor profile (as one might have expected), one of the profiles is almost identical to the compound major-minor profile of the single-profile model (just with shorter tails), while the second profile has three modes that form an augmented triad. The number of pieces associated to the respective profiles are 96/0 (Bach), 88/13 (Beethoven), 8/43 (Liszt). This suggests that reducing the tonality of the Classical and Romantic era to only a major and a minor mode might not always be appropriate. The profiles could rather be interpreted as a ‘conventional diatonic’ profile and an ‘extended tonality’ profile. In fact, when trained on only the Bach pieces (results not included), the static two-profile model learns a major and minor profile, outperforms the TDM, and displays a key classification accuracy of 97%. Our results thus question established key classifiers by showing that their accuracy is very high only in a relatively narrow musical repertoire. In conclusion, the static models strongly overfit on the corpus being used, which may or may not be acceptable, depending on the application. In contrast, note that the TDM has more piece-specific degrees of freedom than the static model but no corpus-level parameters; it is thus immune to this kind of overfitting on the corpus level.
Finally, we compare the performance of the different versions of the TDM. For the line-of-fifths based TDM (Binomial, 1D), we observe a decreasing performance from Bach to Beethoven and Liszt. This corresponds to the music-theoretic insight that Bach’s harmony is predominantly fifths-based, while Beethoven incorporates an increasing amount of third-based harmonic progressions, which is yet again extended by Liszt. Beethoven’s pieces therefore tend to be multimodal on the line of fifths – but not on the Tonnetz. In contrast, some of Liszt’s pieces are multimodal even on the Tonnetz and tend to be fragmented on the line of fifths.
This is also strongly reflected in the different performances of the full TDMs, which both show the best results for Beethoven. However, the reason for the decrease in performance for Bach and Liszt as compared to Beethoven are presumably different.
As mentioned above, Bach’s pieces tend to have PCDs with a relatively sharp decay, which conflicts with the smooth decay of the TDM, more commonly found in Beethoven’s pieces. On the other hand, the mentioned fact that (due to the extended tonality) Liszt’s pieces tend to be multimodal even on the Tonnetz, means that, even though they have a smooth decay, they may not be well captured by the TDM. A slight modification of the TDM to allow for multiple tonal origins, might also allow to model this kind of extended tonality appropriately.
We chose three exemplary pieces with the goal to evaluate whether the TDM is able to capture characteristics of the respective piece and historical period. We used the same three pieces for which the empirical PCDs are shown in Figure 3, that is,
As described above, the piece-level variables were determined via MAP/ML estimation. The results are shown in Figure 6 with the values in brackets indicating the KLD between the empirical and model distributions.
For each of the three example pieces, the MAP estimates for the parameters of the binomial TDM are shown in Figure 7. The parameters µ and σ, given in the caption, are the mean and standard deviation of the path-length distribution. The probabilities p_{i} for choosing the different intervals in a particular step are indicated in the box at the bottom right. The tonal origin is displayed in the center of each plot and the lengths of the arrows (also given in their labels) indicate the expected numbers of steps µp_{i} in the six primary interval directions.
For Bach’s prelude (Figures 3a, 6a and 7a), the model was able to capture the PCD mainly extending along the line of fifths as reflected in the high weights for both perfect fifth intervals +P5 and –P5. The stronger weight for the descending fifth (p_{–P5} = 0.475) reflects that the tonal origin G lies a fifth above the most frequent tone and global tonic C. The strong descending fifth is balanced by the combination of the ascending fifth (p_{+P5} = 0.288) and the descending minor third that also ascends along the line of fifths (p_{–m3} = 0.137). Interestingly, explaining the PCD using the global tonic C as the tonal origin (results not shown) is only marginally less accurate with correspondingly changed weights (stronger ascending fifth and a strong ascending major third instead of the descending minor third). Given the lack of temporal information, this ambiguity is musically consistent and reflects the harmonically close relation of G and C (also see Section 3.4). The general importance of the line of fifths for the organization of the tonal material is characteristic of Baroque pieces.
The optimal parameters for Beethoven’s Sonata movement (Figures 3b, 6b and 7b) also prioritize the two perfect fifth directions, almost in a symmetrical manner (p_{+P5} = 0.331, p_{–P5} = 0.356). However, as opposed to Bach’s piece, a significant amount of the probability mass (≈30%) is assigned to the two major third components (p_{+M3} = 0.160, p_{–M3} = 0.137), again essentially symmetric. This is typical for pieces in the minor mode due to the above-mentioned overall prominence of the (minor) tonic and (major) dominant triads. However, it can also be interpreted as reflecting the stylistic changes in the Classical period that allow for broader ranges of mediantic (i.e. third-based) local key relations, as can also be observed in this movement. The approximate point symmetry of the empirical distribution around the pitch class A and along the three axes of the Tonnetz (see Figure 3b) is reflected in similar weights of the ascending and descending intervals along the same axis, as shown in Figure 6b. Note that A, the point of symmetry and the model’s choice for the tonal origin, in fact is the fifth of the tonic D and the root of the dominant triad. Again, this exemplifies that the tonal origin chosen by the TDM does not need to be the tonic but rather incorporates a more general notion of intervallic relations.
The most diverse distribution of pitch classes is found in Liszt’s piano piece (Figures 3c, 6c and 7c). The empirical distribution in Figure 6c is clearly multimodal on the line of fifths around F♯/C♯, D♯/A♯, and D/A with another smaller mode around F/B♭, likewise pointing towards mediantic relations as in Beethoven’s piece. But unlike the Sonata, the distribution is asymmetric. The extended harmonic relations in Liszt’s piece are reflected in the model outcome, which assigns non-zero probability mass to all six primary intervals, with an acceptable overall fit of the estimated distribution (KLD=0.061) as compared to the two previous pieces and the range of KLD values in Figure 5.
The model was able to capture a number of important characteristics of this piece. The harmonic structure of the entire piece is fundamentally governed by major third relations. Its three sections (Moderato; Andante; and Più sostenuto, quasi Preludio) stand in the keys F♯ major, D major, and B♭ major, respectively. The major-third relation between those keys also permeates each of the sections on more local levels (e.g. D major and B♭ major passages in the F♯ major sections) and thus governs the harmonic structure of the piece on several hierarchical levels.
Since each of the sections is largely diatonic with some ornamental chromaticism, it is not surprising that also for this Romantic piece the perfect fifths together account for more than 50% of the overall weights, followed by the descending minor and major thirds (0.216 and 0.192, respectively). This entails, for example, that the upper major third A♯ of the frequently used F♯ major triad is largely explained by the combination of an ascending fifth and a descending minor third, while D and B♭ are more directly explained as descending major third steps. In particular, the difference of the model explanations between B♭ and A♯ is meaningful since these tonal pitch classes bear different harmonic implications, which would be lost in a neutral pitch-class representation.
We presented the Tonal Diffusion Model (TDM), a generative probabilistic model for pitch-class distributions (PCDs) that incorporates relevant music-theoretic and cognitive insights. As opposed to most existing models for PCDs, the TDM incorporates algebraic and geometric structures from music theory and describes the generation of tones in a piece as a latent cognitive process. It thereby relates the statistical properties of PCDs to musically meaningful and interpretable variables.
The model was evaluated quantitatively on a corpus of 248 pieces showing superior performance to traditional models. Comparing against several baseline models, the positive impact of incorporating the Tonnetz structure was demonstrated. Furthermore, a detailed analysis of three exemplary pieces showed that the TDM is able to capture characteristic properties of these pieces and the respective period.
The TDM is well-suited to study a range of relevant questions in digital musicology and MIR. It may be extended and adapted in multiple ways. For instance, it can be adapted to incorporate more specific assumptions about the underlying cognitive processes and the relevant musical style and it allows for corpus-based studies to investigate historical developments and stylistic differences among a large number of musical pieces. In MIR, the piece-level variables (tonal origin, interval weights, and path-length distribution) can be conceived as a tonal fingerprint of the piece, going beyond the notion of a pitch profile, and thus allowing to determine its tonality in a novel way. The model can be extended to include a larger (or infinite) set of intervals ℐ by employing a full Dirichlet process prior and the modeled cognitive process can be adapted by using independent path-length distributions for the different intervals. The tonal space can be augmented to include tuning differences and take into account different interpretations for different generation paths. And finally, the model may be generalized to include a time component that takes sequential and syntactic dependencies in musical pieces into account.
The TDM thus provides a novel approach to modeling PCDs in a compact and musically interpretable way, while outperforming existing approaches in terms of accuracy. It may thereby serve as a broad foundation for further developing generative models of PCDs in music and opens up multiple highly promising directions for future research.
The additional file for this article can be found as follows:
Supplementary MaterialAppendix. DOI: https://doi.org/10.5334/tismir.46.s1
^{1}The data and code to reproduce our results can be found at https://github.com/DCMLab/tonal-diffusion-model.
This project was partially funded through the Swiss National Science Foundation within the project “Distant Listening – The Development of Harmony over Three Centuries (1700–2000)”. Also, this project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program under grant agreement No 760081 – PMSB. Martin Rohrmeier acknowledges the kind support by Mr Claude Latour through the Latour chair of Digital Musicology at EPFL.
The authors have no competing interests to declare.
Aldwell, E., Schachter, C., & Cadwallader, A. (2010). Harmony and Voice Leading. Cengage Learning, 4th edition.
Bigo, L., & Andreatta, M. (2016). Topological Structures in Computer-Aided Music Analysis. In Meredith, D., editor, Computational Music Analysis, pages 57–80. Springer, Berlin. DOI: https://doi.org/10.1007/978-3-319-25931-4_3
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022.
Chew, E. (2000). Towards a Mathematical Model of Tonality. Doctoral dissertation, Massachussets Institute of Technology, Cambridge, MA.
Clough, J., & Myerson, G. (1985). Variety and multiplicity in diatonic systems. Journal of Music Theory, 29(2), 249–270. DOI: https://doi.org/10.2307/843615
Cohn, R. (1997). Neo-Riemannian operations, parsimonious trichords, and their “Tonnetz” representations. Journal of Music Theory, 41(1), 1–66. DOI: https://doi.org/10.2307/843761
Cohn, R. (2012). Audacious Euphony: Chromatic Harmony and the Triad’s Second Nature. Oxford University Press, Oxford. DOI: https://doi.org/10.1093/acprof:oso/9780199772698.001.0001
Douthett, J., & Steinbach, P. (1998). Parsimonious graphs: A study in parsimony, contextual transformations and modes of limited transposition. Journal of Music Theory, 42(2), 241–263. DOI: https://doi.org/10.2307/843877
Euler, L. (1739). Tentamen novae theoriae musicae ex certissimis harmoniae principiis dilucide expositae. Ex Typographia Academiae Scientiarum, St. Petersburg.
Fiore, T. M., & Noll, T. (2011). Commuting groups and the topos of triads. In Agon, C., Amiot, E., Andreatta, M., Assayag, G., Bresson, J., & Manderau, J., editors, Mathematics and Computation in Music, volume 6726 of Lecture Notes in Artificial Intelligence. Springer, Berlin. DOI: https://doi.org/10.1007/978-3-642-21590-2_6
Gárdonyi, Z., & Nordhoff, H. (2002). Harmonik. Möseler Verlag, Wolfenbüttel.
Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2005). Integrating topics and syntax. Advances in Neural Information Processing Systems, 17, 537–544.
Haas, B. (2004). Die neue Tonalität von Schubert bis Webern: Hören und Analysieren nach Albert Simon. Florian Noetzel, Wilhelmshaven.
Harasim, D., Noll, T., & Rohrmeier, M. (2019). Distant neighbors and interscalar contiguities. In Montiel, M., Gomez-Martin, F., & Agustín-Aquino, O. A., editors, Mathematics and Computation in Music, Lecture Notes in Computer Science, pages 172–184. Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-21392-3_14
Harasim, D., Schmidt, S. E., & Rohrmeier, M. (2016). Bridging scale theory and geometrical approaches to harmony: The voice-leading duality between complementary chords. Journal of Mathematics and Music, 10(3), 193–209. DOI: https://doi.org/10.1080/17459737.2016.1216186
Hostinský, O. (1879). Die Lehre von den musikalischen Klängen: Ein Beitrag zur aesthetischen Begründung der Harmonielehre. H. Dominicus, Prague.
Hu, D. J., & Saul, L. K. (2009a). A probabilistic topic model for music analysis. In 22nd Conference on Neural Information Processing Systems, Workshop on Applications for Topic Models: Text and Beyond.
Hu, D. J., & Saul, L. K. (2009b). A probabilistic topic model for unsupervised learning of musical keyprofiles. In Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009), pages 441–446.
Huron, D., & Veltman, J. (2006). A cognitive approach to medieval mode: Evidence for an historical antecedent to the major/minor system. Empirical Musicology Review, 1(1). DOI: https://doi.org/10.18061/1811/24072
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Bengio, Y., & LeCun, Y., editors, 3rd International Conference on Learning Representations.
Koelsch, S., Rohrmeier, M., Torrecuso, R., & Jentschke, S. (2013). Processing of hierarchical syntactic structure in music. Proceedings of the National Academy of Sciences of the United States of America, 110(38), 15443–8. DOI: https://doi.org/10.1073/pnas.1300272110
Krumhansl, C. L. (1990). Cognitive Foundations of Musical Pitch. Oxford University Press, New York.
Krumhansl, C. L. (1998). Perceived triad distance: Evidence supporting the psychological reality of neo-Riemannian transformations. Journal of Music Theory, 42(2), 265–281. DOI: https://doi.org/10.2307/843878
Krumhansl, C. L., & Kessler, E. J. (1982). Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys. Psychological Review, 89(4), 334–368. DOI: https://doi.org/10.1037/0033-295X.89.4.334
Lerdahl, F., & Jackendoff, R. S. (1983). A Generative Theory of Tonal Music. MIT Press, Cambridge, MA.
Milne, A. J., & Holland, S. (2016). Empirically testing Tonnetz, voice-leading, and spectral models of perceived triadic distance. Journal of Mathematics and Music, 10(1), 59–85. DOI: https://doi.org/10.1080/17459737.2016.1152517
Minka, T., & Winn, J. (2009). Gates. In Advances in Neural Information Processing Systems, pages 1073–1080.
Moss, F. C. (2019). Transitions of Tonality: A Model- Based Corpus Study. Doctoral dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.
Moss, F. C., Loayza, T., & Rohrmeier, M. (2019). pitchplots. DOI: https://doi.org/10.5281/zenodo.3265393
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An imperative style, highperformance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., dAlché-Buc, F., Fox, E., & Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
Rohrmeier, M. (2011). Towards a generative syntax of tonal harmony. Journal of Mathematics and Music, 5(1), 35–53. DOI: https://doi.org/10.1080/17459737.2011.573676
Rohrmeier, M. (2020). The syntax of jazz harmony: Diatonic tonality, phrase structure, and form. Music Theory and Analysis (MTA), 7(1), 1–63. DOI: https://doi.org/10.11116/MTA.7.1.1
Schenker, H. (1935). Der freie Satz. Universal Edition, Wien.
Schoenberg, A. (1969). Structural Functions of Harmony. Norton, New York.
Selfridge-Field, E., editor (1997). Beyond MIDI: The Handbook of Musical Codes. MIT Press, Cambridge, MA.
Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. In Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W., editors, Handbook of Latent Semantic Analysis, pages 424–440. Lawrence Erlbaum Associates, Mahwah, NJ.
Temperley, D. (2000). The line of fifths. Music Analysis, 19(3), 289–319. DOI: https://doi.org/10.1111/1468-2249.00122
Toiviainen, P., & Krumhansl, C. L. (2003). Measuring and modeling real-time responses to music: The dynamics of tonality induction. Perception, 32(6), 741–766. DOI: https://doi.org/10.1068/p3312
Tymoczko, D. (2011). A Geometry of Music: Harmony and Counterpoint in the Extended Common Practice. Oxford University Press, Oxford.
von Oettingen, A. (1866). Harmoniesystem in dualer Entwicklung. W. Gläser, Dorpat und Leipzig.
Weber, G. (1851). The Theory of Musical Composition, Treated with a View to a Naturally Consecutive Arrangement of Topics. Messrs Robert Cocks and Co., London.