The Tonal Diffusion Model

Pitch-class distributions are of central relevance in music information retrieval, computational musicology and various other fields, such as music perception and cognition. However, despite their structure being closely related to the cognitively and musically relevant properties of a piece, many existing approaches treat pitch-class distributions as fixed templates. In this paper, we introduce the Tonal Diffusion Model, which provides a more structured and interpretable statistical model of pitch-class distributions by incorporating geometric and algebraic structures known from music theory as well as insights from music cognition. Our model explains the pitch-class distributions of musical pieces by assuming tones to be generated through a latent cognitive process on the Tonnetz, a well-established representation for harmonic relations. Specifically, we assume that all tones in a piece are generated by taking a sequence of interval steps on the Tonnetz starting from a unique tonal origin. We provide a description in terms of a Bayesian generative model and show how the latent variables and parameters can be efficiently inferred. The model is quantitatively evaluated on a corpus of 248 pieces from the Baroque, Classical, and Romantic era and describes the empirical pitch-class distributions more accurately than conventional template-based models. On three concrete musical examples, we demonstrate that our model captures relevant harmonic characteristics of the pieces in a compact and interpretable way, also reflecting stylistic aspects of the respective epoch.


Introduction
Pitch-class distributions (PCDs) are fundamental objects of study in many fields of empirical music research, from music information retrieval (MIR), mathematical music theory, and computational musicology to music perception, cognition, and corpus studies (Huron and Veltman, 2006). They are closely related to relevant cognitive and musical properties of the pieces, such as their key, degree of consonance and dissonance, and characteristics of the modulation plan. A PCD may thus be understood as a fingerprint of a piece that contains important information about its tonality.
However, despite the meaningful structure of PCDs, many existing approaches do not make the relevant aspects explicit in the way PCDs are modeled and represented. In fact, PCDs are mostly treated as fixed templates, whose structure is only interpreted in a post-hoc manner. This is also reflected in the way PCDs are commonly visualized, namely as categorical distributions of twelve chromatic semitones, where visual proximity does not necessarily reflect musical relations. Ignoring music-theoretical insights about the structure of tonal space for representing and visualizing PCDs conceals deeper structural relations that are reflected in these distributions.
The goal of this paper therefore is to provide a compact, structured and interpretable representation of PCDs. The proposed Tonal Diffusion Model (TDM) achieves this in two ways: first, by leveraging relevant music-theoretic and cognitive insights, specifically, the algebraic representation of tonal space called the Tonnetz (Hostinský, 1879;Cohn, 1997); and second, by defining an explicit statistical model, which independently generates all tones in a piece by starting at a piece-specific tonal origin and taking steps in different directions on the Tonnetz.
We quantitatively evaluate our model against several baseline models on a corpus of 248 pieces from the Baroque, Classical, and Romantic era, demonstrating that the TDM provides music-theoretically reasonable interpretations. Moreover, it models the empirical PCDs more accurately than the unstructured baseline models as measured by the Kullback-Leibler divergence (KLD) to the empirical PCD.

Representations of Pitch-Class Distributions
Many large corpora are available in the MIDI format, which encodes pitch through an integer representation, commonly interpreted as twelve-tone equal temperament (12TET) (Selfridge-Field, 1997). MIDI thus has the major drawback that it does not preserve the musically relevant distinctions between enharmonically equivalent pitch classes that are expressed in pitch spelling. This simplified representation has largely carried over to PCDs, which are commonly visualized in a linear chromatic arrangement.
This implies three shortcomings (see Appendix A for a more detailed discussion): 1) the visual representation in a linear arrangement suggests some implicit ordering and proximity relation that is not explicitly modeled; 2) the commonly used chromatically ascending order does not reflect tonal relations well, especially with respect to harmony and key, where an arrangement along the line or circle of fifths would be better suited; and 3) the space of pitch classes exhibits an inherent cyclic topology, which is not reflected in a linear arrangement. The TDM overcomes these limitations by using the Tonnetz as basis representation for PCDs, which preserves pitch spelling, and by representing proximity relations via steps on the Tonnetz. Moreover, the Tonnetz extends the purely fifth-based representation by additionally allowing for proximity relations based on major and minor third steps.

Empirical Evidence for the Tonnetz
The pervasive categorical representation of PCDs is also used to display the results of psychological probetone experiments where participants rate different degrees of stability of tones in 12TET (Krumhansl and Kessler, 1982). The underlying assumption in these experiments is that enculturated listeners have acquired mental representations of tonality through exposure to music via statistical learning, which approximately corresponds to inferring the note distributions in musical corpora (Krumhansl, 1990;Huron and Veltman, 2006).
Several studies have shown that PCDs suggest an arrangement that is essentially equivalent to the Tonnetz (see Section 3.2). Krumhansl and Kessler (1982) use multidimensional scaling (MDS) to find a toroidal structure of the 24 key profiles, obtained by transposing the major and minor profiles to all twelve keys. The appropriateness to model tonal space as a torus has a long-standing tradition in music theory and is supported by these empirical findings.
The cognitive relevance of this structure has been further investigated by Krumhansl (1998), where the Tonnetz is used to model the perceived distance of triads. Toiviainen and Krumhansl (2003) use a selforganizing map, which essentially reproduces the results of the MDS. Their study extends prior research by modeling listeners' dynamic perception of tonality in a musical piece over time, which is reflected in high activation of certain regions on the Tonnetz. Milne and Holland (2016) compare three different models for perceived triadic distance, namely the psychoacoustic measure of spectral pitch-class distance, distance on the Tonnetz, and voice-leading distance between the triads. They conclude that the Tonnetz is highly predictive for empirical data of triadic similarity and correlates strongly with the psychoacoustic measure.
Our approach builds upon these prior works on the cognitive relevance of the Tonnetz by using it more generally to represent empirical PCDs.

Topic Models
The basic structure of the TDM (see Section 4) is similar to that of probabilistic topic models (Blei et al., 2003;Steyvers and Griffiths, 2007), which have successfully been employed for the prediction of word frequencies in natural language. Topic models represent a corpus as a bag (an unordered collection) of documents (e.g. texts or pieces) and each document as a bag of items (e.g. words or notes). However, language models do not incorporate the kind of structure present in music, which can be expressed in algebraic (Fiore and Noll, 2011) and geometric (Tymoczko, 2011) models. Most topic models for language assume no relation between words and topics besides their probability of co-occurrence, even though there are some attempts to overcome this limitation (Griffiths et al., 2005).
A topic model for music should therefore go beyond those developed for natural language by explicitly modeling the structural relations of pitch classes in tonal music and incorporating the geometric and algebraic structures known from music theory as well as insights from music cognition.
The only music-specific application of topic models we are aware of is by Hu and Saul (2009a,b). However, while sharing the general topic model structure with our TDM, there are two profound differences (see Appendix B for a more detailed discussion): 1) Hu and Saul address the problem of key finding and learn two static key profiles in twelve enharmonically equivalent pitch classes, whereas our model aims to provide a more fine-grained representation of tonality that goes beyond a binary major/minor classification, also taking into account pitch spelling. 2) Hu and Saul infer separate key labels for different sections of each piece, while our goal is to compactly represent the global harmonic character of the entire piece.
Consequently, while Hu and Saul evaluate their model by comparing its classification accuracy to that of established template-based key finders, we use the KLD to the empirical distribution as a performance measure, which makes a direct comparison of the present approach with that of Hu and Saul difficult. Note, however, that our baseline model with two static key profiles is conceptually very similar to the model by Hu and Saul.

A Cognitive Interpretation of the Tonnetz
Based on the music-theoretical considerations discussed above, we present a cognitive interpretation of the Tonnetz and describe a model of the latent cognitive process, grounded in the perception of tonal music.

Representation of Tones
We represent tones on the line of fifths (Temperley, 2000). That is, the tonal space T is the set of tones generated by taking an arbitrary number of steps from a given reference tone by either descending or ascending a perfect fifth (see Figure 1). Assuming octave equivalence, tonal pitch classes (TPCs), used for labeling notes in Western music, have a one-to-one mapping to the line of fifths (second row in Figure 1). The set of intervals I , describing the distance between two tones, is the set of directed TPC intervals (third and fourth row in Figure 1).
We use '+' and '-' to indicate the direction (ascending or descending), describe the generic intervals (Clough and Myerson, 1985;Harasim et al., 2016) with Arabic numerals (1: unison, 2: second, 3: third, 4: fourth, 5: fifth etc.), and denote the quality of the intervals (diminished, minor, perfect, major, and augmented) with 'd', 'm', 'P', 'M', 'A', respectively. Note that in this representation, we are able to distinguish enharmonically equivalent intervals, such as +A4, the ascending augmented fourth (tritone), and +d5, the ascending diminished fifth. This is a musically and cognitively relevant distinction: since the interval +d5 commonly resolves inwards into a third (+M3 or +m3) and +A4 (the tritone) commonly resolves outwards into a sixth (+m6 or +M6), these intervals create different harmonic expectations that imply different cognitive representations. As discussed in Section 2.1, a major drawback of the commonly used MIDI format is that it cannot be used to represent these musically important distinctions.
Our goal in this paper is to provide a statistical model for music that allows to describe tonal pitch class distributions of musical pieces in a compact, structured and interpretable way. In the following, the term pitch class always refers to TPCs and the term interval refers to directed TPC intervals.

The Tonnetz
While the line of fifths connects all possible pitch classes and is of central importance in tonal music (Gárdonyi and Nordhoff, 2002;Temperley, 2000;Weber, 1851), there are other ways of relating the same two tones via one or more interval steps. Of particular importance for harmonic relations in tonal music are, apart from the perfect fifth (±P5), also the major third (±M3) and the minor third (±m3), both ascending and descending (Aldwell et al., 2010;Haas, 2004). We call these the primary intervals and for our evaluation of the TDM we will restrict the set of intervals I to the primary intervals. Figure 2 shows these primary interval relations with respect to the central tone C.
The infinite expansion of this graph is called the Tonnetz (Bigo and Andreatta, 2016;Cohn, 1997Cohn, , 2012Harasim et al., 2019) and has a number of historic precursors (e.g., Euler, 1739;Hostinský, 1879;von Oettingen, 1866;Riemann, 1896). Using the representation as a hexagonal graph (Douthett and Steinbach, 1998), PCDs can be plotted as color-coded heatmaps (Moss, 2019). Figure 3 shows the distributions of three pieces from three different musical epochs, which we also use for our evaluation below, namely: In the case of tonal pitch classes, the Tonnetz is topologically equivalent to the Spiral Array (Chew, 2000). In the two-dimensional visualizations in Figure 3, where this structure is 'unrolled', the same tonal pitch class may appear multiple times and is then colored identically. When looking at the distributions in Figure 3 it becomes obvious that the different pieces exhibit characteristic structures along the different axes of the Tonnetz. Bach's piece is largely distributed along the line of fifths (horizontally) with its tonic C occurring most frequently. Beethoven's Sonata movement is also distributed along the line of fifths but it exhibits a considerably wider spread that is roughly symmetric about the most frequent tone A (the tonic of the dominant key). Furthermore, the pitch classes of the tonic D-minor and the dominant A-major triads occupy most probability mass, resulting in the major third axis (F-A-C♯) being more pronounced than in Bach's piece. Finally, the pitch classes in Liszt's piece are most widely distributed, covering a range from F♭ to D♯♯ (24 fifths). In contrast to the other two pieces, the particularly wide distribution of pitch classes in this piece reflects the fact that it juxtaposes harmonically distant keys (in particular through mediantic relations), such as F♯, D, and B♭. These key relations are not reflected in the lineof-fifths topology but are explicitly modeled by the major and minor third axes of the Tonnetz, which suggests the Tonnetz to be a more suitable representation for this kind of extended tonality. We will discuss the tonal structures of these three pieces in more detail in Section 5.3.

Moving on the Tonnetz
A central idea of the TDM is that the generation of tones is associated to certain motions on the Tonnetz. Specifically, the Tonnetz captures harmonic relations, so that moving on the Tonnetz corresponds to typical harmonic changes occurring in a piece. These changes happen on several time scales, from large scale modulations between different sections of a piece, to harmonic progressions, and polyphony on the surface level. Motion on the Tonnetz thus reflects the deep structural dynamics of a piece. Different paths through the Tonnetz express different harmonic interpretations of the involved tones. This distinction of harmonic functions expressed in different paths on the Tonnetz is an explicit part of our model.
For instance, the tone E can be reached from the tone C by ascending four perfect fifths or by ascending a major third. In the first case, E is conceived to be the perfect fifth of A, which is the perfect fifth of D, which is the perfect fifth of G, which, ultimately, is the perfect fifth of C, reflecting a recursive application of applied dominants, while in the second case, the tone E stands in a direct major third relation to C. We take this distinction as expressing two different interpretations of the harmonic function of E relative to C. That is, we assume that it corresponds to a difference on the cognitive level, because it relates the tones C and E in two different ways.
In some contexts, this might be reflected in different tunings and ways of hearing, as ascending four perfect fifths leads to the Pythagorean major third E, while the E reached by ascending a major third corresponds to the major third E in just intonation. However, since these different paths from C to E are indistinguishable in a notated score, our cognitive model does not require them to be physically distinct.
In the TDM, we assume that different paths have different probabilities of occurrence, that is, that some paths are more likely than others. For instance, we assume that different steps occur with different probabilities and that paths cannot have an arbitrary length. The step probabilities and the path-length distribution are explicit components of the TDM (Section 4). Altogether, these aspects reflect different ways of hearing tonal relations, as well as different stylistic characteristics of tonal music in general and of concrete pieces more specifically.

The Tonal Origin
Tonal music is fundamentally characterized by the existence of one or more tonal centers that are related to each other in a hierarchical manner, for instance, by modulating through different local keys (Schenker, 1935;Schoenberg, 1969;Lerdahl and Jackendoff, 1983;Rohrmeier, 2011;Koelsch et al., 2013;Rohrmeier, 2020). In the TDM, we therefore introduce the concept of the tonal origin and assume that a unique tonal origin exists for each piece (see Section 4.2). Intuitively, the tonal origin corresponds to the single pitch class that is best suited to explain all tones in the piece by starting at this point on the Tonnetz and taking a small number of primary interval steps. There are two important caveats to keep in mind for interpreting the tonal origin: 1) Assuming a single tonal origin does not mean that the piece cannot modulate between different keys having their own respective local centers. 2) The tonal origin in our model is not necessarily equivalent to the tonic of the global key. Please see Appendix C for a more detailed discussion of these two points.

The Tonal Diffusion Model
The purpose of the TDM is to model the intuitions presented in Section 3 formally and derive quantitative measures that capture the diffusion of probability mass from the tonal origin along the different axes of the Tonnetz. To this end, we define an explicit generative model, in which each tone in a piece is generated by starting at the tonal origin and taking a number of steps on the Tonnetz. As described in Section 3.3, there are many possible paths connecting two tones on the Tonnetz, which correspond to different cognitive interpretations of their harmonic relation. In our model, these different derivations are treated as a latent representation, which is marginalized out to determine the overall probability of reaching a particular tone.
In the general formulation of our model, we include the possibility of multiple tonal origins, and allow for an arbitrary set of intervals, as well as a generic path-length distribution. For the evaluation of the model on tonal music, presented in Section 5, we then make the more specific assumptions motivated in Section 3. Specifically, we assume a single tonal origin (Section 3.4) and restrict the allowed interval steps to the set of primary intervals present in the Tonnetz (Section 3.2). The TDM has the basic structure of a topic model with two nested levels of generation, as shown in Figure 4a. On the inner level, each piece (i.e. each document) d is represented as a 'bag of tones' and all tones t ϵ d are generated independently, conditional on the piece-level variables β. On the outer level, the corpus D is represented as a 'bag of pieces' and all pieces d ϵ D are generated independently, conditional on the corpus-level parameters α. The tones t of a piece are the observed variables while γ represents any latent variables involved in the generation of a single tone t .
We extend this basic structure by splitting the corpuslevel and piece-level variables into multiple distinct variables with a clear semantics and by replacing the inner generative step for a single tone with a model of the underlying latent cognitive process. The complete model is shown in Figure 4b and will now be explained in detail.

Generative Process
Let T ≡ ℤ be the space of all possible TPCs and I = {t-t′|t,t′∈ T } ≡ ℤ be the space of all TPC intervals (see also Figure 1). We assume the following generative process, which corresponds to the graphical model in Figure 4b.
For each piece d in a corpus D draw (2) that is, the distribution of tonal origins c is drawn from a Dirichlet process with base distribution H c over the tonal space T and concentration parameter α c ; and the interval weights w are drawn from a Dirichlet process with base distribution H w over the interval space I and concentration parameter α w . Furthermore, the pathlength distribution λ is drawn from a prior with hyper- where the prior depends on the specific path-length distribution being used: for a Poisson distribution the conjugate prior is a gamma distribution, for a binomial distribution it is a beta distribution.
that is, a number of steps n is drawn from the pathlength distribution (Poisson or binomial); a series of latent tones τ 0 ,…,τ n is generated by first drawing τ 0 from the distribution of tonal origins and then transitioning n times from τ i to τ i+1 by adding an interval; finally, the last tone τ n of the sequence is observed as t.

Inference
For a given corpus D of musical pieces and corpus-level where, following Bayes' theorem, p(β;α) is the prior over piece-level variables and p(t|β) ≡ p(t|c,w,λ) is the marginal likelihood for a single tone t in the piece d with the latent variables (γ ≡ τ 0 ,…,τ n ,n) being marginalized out. Since in our model all tones within a piece are drawn independently and identically distributed, the distribution p(t|c,w,λ) needs to be computed only once per piece.
The expansion of the marginal likelihood p(t|c,w,λ) is shown in Equation (10), where p(n|λ) is the distribution over path lengths, p(τ i |τ i-1 ,w) are the transition probabilities in the latent cognitive process, and p(τ 0 |c) is the initial distribution corresponding to possible tonal origins in the piece. To compute p(t|c,w,λ), we have to explicitly marginalize out the latent variables (γ ≡ τ 0 ,…,τ n ,n). The marginal likelihood p(t|c,w,λ) has a recursive structure, which becomes apparent in Equation (10) by highlighting the intermediate terms p(τ i |c,w), where the subset of latent variables τ 0 ,…,τ i-1 has already been marginalized out: the latent cognitive process is a Markov chain in the tonal space T and the marginal distribution p(τ i |c,w) at step i of the process can be recursively computed via dynamic programming. The path length n can be marginalized out by interleaving the dynamic programming updates with weighted increments to the final distribution p(t|c,w,λ), resulting in the procedure shown in Algorithm 1. Being able to compute the marginal likelihood p(t|c,w,λ) allows to optimize the piece-level variables c, w and λ to find their MAP estimate.
For our evaluation, we make three additional assumptions that are specific to tonal music and wellestablished in music theory (see Section 3): 1) we assume that only a single tonal origin per piece exists and that all tones are a priori equally likely to become the tonal origin; 2) we restrict the number of allowed interval steps to the six primary intervals present in the Tonnetz; 3) we assume a uniform prior over the path-length variable λ. These assumptions facilitate inference (see Appendix D for technical details) and imply that computing a MAP estimate becomes equivalent to a maximum likelihood (ML) estimate.
We infer values for the piece-level variables β by maximizing their posterior probability or (equivalently, due to using uniform priors) minimizing the negative data log-likelihood, cross-entropy, or Kullback-Leibler divergence (KLD). The model was implemented in PyTorch (Paszke et al., 2019) and optimization was done in two nested procedures: 1) The values for w and λ were optimized via gradient decent using the Adam optimizer (Kingma and Ba, 2015). 2) For specific values of the weight and path-length variables, w and λ¸ we chose the tonal origin c that minimizes the KLD (gradients are propagated through this step using PyTorch's automatic differentiation functionality).

Results and Discussion
We evaluate the TDM in two ways: first, we introduce several baseline models (Section 5.1) and perform a quantitative comparison (Section 5.2) on a corpus of 248 pieces from different historical epochs. Second, we perform a detailed qualitative analysis (Section 5.3) on three exemplary pieces, inspecting the inferred parameters and discussing our musical interpretation of these results. 1

Baseline Models
We introduce several baseline models to verify whether the structural assumptions incorporated in our model effectively improve performance. In particular, we are interested in validating the impact of the Tonnetz topology and the assumed latent process on model performance.
The two static baseline models, Static (1 Profile) and Static (2 Profiles), do not incorporate any topological information except for transposition invariance. These models also lack the semantic interpretability aimed for with the TDM and most closely resemble the categorical representation of PCDs, as used in template-based key finders or the model of Hu and Saul (2009b). The lineof-fifths model, TDM (Binomial, 1D), is a reduced version of the TDM that uses the line of fifths but not the full Tonnetz topology. To evaluate the influence of the path-length distribution we also include TDM (Poisson), a version of the full TDM with a Poisson path-length distribution, in addition to the best-performing TDM (Binomial) with a binomial path-length distribution.

Static Model (1 Profile)
The static baseline model consists of a single global PCD, that is, a fixed template or profile. The model has an individual parameter for each tonal pitch class; together these parameters determine the shape of the profile and they are trained via gradient descent on the entire corpus.

Algorithm 1 Computing the marginal likelihood
p(t|c,w,λ) by explicitly marginalizing out the latent variables γ ≡ (τ 0 ,…, τ n ,n). In practice, the infinite loop over n is terminated by summing over p(n|λ) and stopping when this cumulative probability approaches 1 (with tolerance 10 -5 ). The profile is individually matched to each piece (during training and for evaluation) by shifting it along the line of fifths. The model thus has a large number of continuous corpus-level parameters (one for each tonal pitch class) but only a single discrete piece-specific variable (the transposition). The optimal profile corresponds to the purple line in Figures 6a-6c.

Static Model (2 Profiles)
This model is identical to the static model described above but comprises two different static PCDs. Matching with a specific piece is done by choosing the best profile and transposition. It thus has two discrete piece-specific variables, the transposition (as above) and the binary profile class. This model is conceptually very similar to conventional template-based keyfinding algorithms with two profiles, which are usually assumed to correspond to the major and minor modes -a questionable assumption that will be discussed below. The optimal profiles correspond to the blue line in Figure 6a (first profile) and Figure 6b and 6c (second profile).

Line-of-Fifths TDM (Binomial, 1D)
To specifically test the impact of the Tonnetz topology, we introduce a reduced version of the TDM that uses only the line-of-fifths topology. This model is identical to the full TDM (with binomial path-length distribution), except that it only uses fifth steps (no major or minor thirds). The model has three piece-specific variables (one for weighting +P5 against -P5 and two for the binomial distribution) and produces bell-shaped PCDs on the line of fifths, including Gaussian and skewed bell shapes.

Corpus Evaluation
For the quantitative evaluation of the TDM, we used a corpus of 248 pieces that are representative for the Baroque, Classical, and Romantic era: all preludes and fugues from Bach's Well-Tempered Clavier Vol. I & II (96 pieces), all movements from Beethoven's piano sonatas (101 pieces), and 51 pieces by Liszt, including etudes, pieces from the Années de Pèlerinage, and the B-minor Sonata.
All models were trained on the entire corpus to minimize the Kullback-Leibler divergence (KLD) between the model and empirical PCD (or, equivalently, minimize the cross-entropy, or maximize the data likelihood). To train the two static models with corpus-level parameters, the three composers were weighted equally: each piece inversely proportional to the total number of pieces of the respective composer. The results are shown in Figure 5.
As expected, the static model with a single global PCD performs worst. The inferred profile can be seen in the detailed analysis of the single pieces in Figure 6. It is bimodal and can be understood as a combined majorminor profile with additional long tails along the line of fifths. For pieces in major mode, the lower mode coincides with the tonic and dominant pitch class, while the upper mode covers the major third. For pieces in minor mode the reverse is true, with the upper mode corresponding to the tonic and dominant, while the lower mode covers the minor third. This single profile can be interpreted as the best attempt to match all pieces in the corpus to a single profile.
Also not surprisingly, the static model with two profiles performs considerably better. Notably, for the Bach pieces this model performs as well as the TDM (Poisson) and almost as well as the TDM (Binomial). Presumably, the main reason for the strong performance of the static model on Bach's pieces is the fact these pieces are typically confined to a relatively well-defined range on the line of fifths, after which the PCD quickly drops to zero. This shape does not correspond to the smooth decay assumed in the TDM, which is more typically observed in the pieces by Beethoven. In contrast to the relatively good performance for Bach's pieces, for Beethoven and Liszt the static two-profile model performs even worse than the reduced line-of-fifth TDM (Binomial, 1D). Inspecting the two inferred profiles in Figure 6 reveals an interesting property: instead of learning a major and a minor profile (as one might have expected), one of the profiles is almost identical to the compound major-minor profile of the single-profile model (just with shorter tails), while the second profile has three modes that form an augmented triad. The number of pieces associated to the respective profiles are 96/0 (Bach), 88/13 (Beethoven), 8/43 (Liszt). This suggests that reducing the tonality of the Classical and Romantic era to only a major and a minor mode might not always be appropriate. The profiles could rather be interpreted as a ' conventional diatonic' profile and an ' extended tonality' profile. In fact, when trained on only the Bach pieces (results not included), the static two-profile model learns a major and minor profile, outperforms the TDM, and displays a key classification accuracy of 97%. Our results thus question established key classifiers by showing that their accuracy is very high only in a relatively narrow musical repertoire. In conclusion, the static models strongly overfit on the corpus being used, which may or may not be acceptable, depending on the application. In contrast, note that the TDM has more piece-specific degrees of freedom than the static model but no corpus-level parameters; it is thus immune to this kind of overfitting on the corpus level.
Finally, we compare the performance of the different versions of the TDM. For the line-of-fifths based TDM (Binomial, 1D), we observe a decreasing performance from Bach to Beethoven and Liszt. This corresponds to the musictheoretic insight that Bach's harmony is predominantly fifths-based, while Beethoven incorporates an increasing amount of third-based harmonic progressions, which is yet again extended by Liszt. Beethoven's pieces therefore tend to be multimodal on the line of fifths -but not on the Tonnetz. In contrast, some of Liszt's pieces are multimodal even on the Tonnetz and tend to be fragmented on the line of fifths. This is also strongly reflected in the different performances of the full TDMs, which both show the best results for Beethoven. However, the reason for the decrease in performance for Bach and Liszt as compared to Beethoven are presumably different.
As mentioned above, Bach's pieces tend to have PCDs with a relatively sharp decay, which conflicts with the smooth decay of the TDM, more commonly found in Beethoven's pieces. On the other hand, the mentioned fact that (due to the extended tonality) Liszt's pieces tend to be multimodal even on the Tonnetz, means that, even though they have a smooth decay, they may not be well captured by the TDM. A slight modification of the TDM to allow for multiple tonal origins, might also allow to model this kind of extended tonality appropriately.

Case Studies
We chose three exemplary pieces with the goal to evaluate whether the TDM is able to capture characteristics of the respective piece and historical period. We used the same three pieces for which the empirical PCDs are shown in Figure 3, that is, As described above, the piece-level variables were determined via MAP/ML estimation. The results are shown in Figure 6 with the values in brackets indicating the KLD between the empirical and model distributions.
For each of the three example pieces, the MAP estimates for the parameters of the binomial TDM are shown in Figure 7. The parameters µ and σ, given in the caption, are the mean and standard deviation of the path-length distribution. The probabilities p i for choosing the different intervals in a particular step are indicated in the box at the bottom right. The tonal origin is displayed in the center of each plot and the lengths of the arrows (also given in their labels) indicate the expected numbers of steps µp i in the six primary interval directions.

Bach: C major Prelude
For Bach's prelude (Figures 3a, 6a and 7a), the model was able to capture the PCD mainly extending along the line of fifths as reflected in the high weights for both perfect fifth intervals +P5 and -P5. The stronger weight for the descending fifth (p -P5 = 0.475) reflects that the tonal origin G lies a fifth above the most frequent tone and global tonic C. The strong descending fifth is balanced by the combination of the ascending fifth (p +P5 = 0.288) and the descending minor third that also ascends along the line of fifths (p -m3 = 0.137). Interestingly, explaining the PCD using the global tonic C as the tonal origin (results not shown) is only marginally less accurate with correspondingly changed weights (stronger ascending fifth and a strong ascending major third instead of the descending minor third). Given the lack of temporal information, this ambiguity is musically consistent and reflects the harmonically close relation of G and C (also see Section 3.4). The general importance of the line of fifths for the organization of the tonal material is characteristic of Baroque pieces.

Beethoven: 'Tempest' Sonata
The optimal parameters for Beethoven's Sonata movement (Figures 3b, 6b and 7b) also prioritize the two perfect fifth directions, almost in a symmetrical manner (p +P5 = 0.331, p -P5 = 0.356). However, as opposed to Bach's piece, a significant amount of the probability mass (≈30%) is assigned to the two major third components (p +M3 = 0.160, p -M3 = 0.137), again essentially symmetric. This is typical for pieces in the minor mode due to the above-mentioned overall prominence of the (minor) tonic and (major) dominant triads. However, it can also be interpreted as reflecting the stylistic changes in the Classical period that allow for broader ranges of mediantic (i.e. third-based) local key relations, as can also be observed in this movement. The approximate point symmetry of the empirical distribution around the pitch class A and along the three axes of the Tonnetz (see Figure 3b) is reflected in similar weights of the ascending and descending intervals along the same axis, as shown in Figure 6b. Note that A, the point of symmetry and the model's choice for the tonal origin, in fact is the fifth of the tonic D and the root of the dominant triad. Again, this exemplifies that the tonal origin chosen by the TDM does not need to be the tonic but rather incorporates a more general notion of intervallic relations.

Liszt, Bénédiction de Dieu dans la Solitude
The most diverse distribution of pitch classes is found in Liszt's piano piece (Figures 3c, 6c and 7c). The empirical distribution in Figure 6c is clearly multimodal on the line of fifths around F♯/C♯, D♯/A♯, and D/A with another smaller mode around F/B♭, likewise pointing towards mediantic relations as in Beethoven's piece. But unlike the Sonata, the distribution is asymmetric. The extended harmonic relations in Liszt's piece are reflected in the model outcome, with different models (colored plots). The corresponding Kullback-Leibler divergence is indicated in square brackets after the model.    which assigns non-zero probability mass to all six primary intervals, with an acceptable overall fit of the estimated distribution (KLD=0.061) as compared to the two previous pieces and the range of KLD values in Figure 5.
The model was able to capture a number of important characteristics of this piece. The harmonic structure of the entire piece is fundamentally governed by major third relations. Its three sections (Moderato; Andante; and Più sostenuto, quasi Preludio) stand in the keys F♯ major, D major, and B♭ major, respectively. The major-third relation between those keys also permeates each of the sections on more local levels (e.g. D major and B♭ major passages in the F♯ major sections) and thus governs the harmonic structure of the piece on several hierarchical levels.
Since each of the sections is largely diatonic with some ornamental chromaticism, it is not surprising that also for this Romantic piece the perfect fifths together account for more than 50% of the overall weights, followed by the descending minor and major thirds (0.216 and 0.192, respectively). This entails, for example, that the upper major third A♯ of the frequently used F♯ major triad is largely explained by the combination of an ascending fifth and a descending minor third, while D and B♭ are more directly explained as descending major third steps. In particular, the difference of the model explanations between B♭ and A♯ is meaningful since these tonal pitch classes bear different harmonic implications, which would be lost in a neutral pitch-class representation.

Conclusion and FutureWork
We presented the Tonal Diffusion Model (TDM), a generative probabilistic model for pitch-class distributions (PCDs) that incorporates relevant music-theoretic and cognitive insights. As opposed to most existing models for PCDs, the TDM incorporates algebraic and geometric structures from music theory and describes the generation of tones in a piece as a latent cognitive process. It thereby relates the statistical properties of PCDs to musically meaningful and interpretable variables.
The model was evaluated quantitatively on a corpus of 248 pieces showing superior performance to traditional models. Comparing against several baseline models, the positive impact of incorporating the Tonnetz structure was demonstrated. Furthermore, a detailed analysis of three exemplary pieces showed that the TDM is able to capture characteristic properties of these pieces and the respective period.
The TDM is well-suited to study a range of relevant questions in digital musicology and MIR. It may be extended and adapted in multiple ways. For instance, it can be adapted to incorporate more specific assumptions about the underlying cognitive processes and the relevant musical style and it allows for corpus-based studies to investigate historical developments and stylistic differences among a large number of musical pieces. In MIR, the piece-level variables (tonal origin, interval weights, and path-length distribution) can be conceived as a tonal fingerprint of the piece, going beyond the notion of a pitch profile, and thus allowing to determine its tonality in a novel way. The model can be extended to include a larger (or infinite) set of intervals I by employing a full Dirichlet process prior and the modeled cognitive process can be adapted by using independent path-length distributions for the different intervals. The tonal space can be augmented to include tuning differences and take into account different interpretations for different generation paths. And finally, the model may be generalized to include a time component that takes sequential and syntactic dependencies in musical pieces into account.
The TDM thus provides a novel approach to modeling PCDs in a compact and musically interpretable way, while outperforming existing approaches in terms of accuracy. It may thereby serve as a broad foundation for further developing generative models of PCDs in music and opens up multiple highly promising directions for future research.

Note
1 The data and code to reproduce our results can be found at https://github.com/DCMLab/tonaldiffusion-model.

A. Representations of Pitch-Class Distributions
Many large corpora are available in the MIDI format (Huang et al., 2017;Madsen et al., 2007;Rohrmeier and Cross, 2008;White, 2014), possibly inferred from audio (Serrà et al., 2012;Weiß et al., 2018). A major drawback of the MIDI representation is the assumption of twelve equivalence classes, which do not allow for the distinction between enharmonically equivalent pitch classes (Selfridge-Field, 1997). As a consequence, using the MIDI representation implies that many relevant musical relations are obfuscated, such as the difference between enharmonically equivalent notes in dominant seventh and German sixth chords, for instance, (A♭, C, E♭, G♭) versus (A♭, C, E♭, F♯). Algorithms for pitch spelling (Bora et al., 2018;Cambouropoulos, 2003;Chew and Chen, 2005;Meredith, 2006;Stoddard et al., 2004) attempt to infer this missing information from the musical context and can to some degree be employed to mitigate this problem. Recent research has also produced datasets in which the tonal spelling of pitch classes is retained, for instance, in the **kern, MusicXML, or MEI formats, which are increasingly being made available online (Freedman, 2014;Fujinaga et al., 2014;Sapp, 2005).
As the large datasets in MIDI format are widely used in the MIR community, this simplified representation has largely carried over to the representation of PCDs. In musical corpus studies, PCDs are often obtained by simply counting the relative frequencies of notes (Temperley and Marvin, 2008;Albrecht and Huron, 2014) resulting in categorical distributions with one separate and independent parameter for each pitch class. They are visualized in a linear chromatic arrangement. This implies three shortcomings: 1) The visual representation in a linear arrangement suggests some ordering and proximity relation (and thus an implicit dependency between the pitch classes) that is not reflected in the categorical distribution. 2) The commonly used chromatically ascending order does not reflect tonal relations well, especially with respect to harmony and key. An arrangement along the line or circle of fifths would be better suited for these relations.
3) The space of pitch classes exhibits an inherent cyclic topology (whether arranged chromatically or along the circle of fifths), which is not reflected in a linear arrangement.
A more recent approach from mathematical music theory is the application of the Discrete Fourier Transform (DFT) to PCDs, which takes the cyclical nature of pitch-class space into account (e.g. Amiot, 2016;Noll, 2019;Quinn, 2006Quinn, , 2007Yust, 2019). The Fourier representation of PCDs makes use of the implicit ordering and topology and has some interpretatory value as different coefficients can be associated to different musical properties (e.g. the third coefficient reflects triadicity and the fifth coefficient diatonicity). While in some cases the Fourier representation may yield a somewhat better interpretability than the categorical representation, it still has two drawbacks: 1) Inherent to the Fourier representation is the assumption of a circular pitch-class space, which implies the same shortcomings in expressing musical relations as for the MIDI representation. 2) The Fourier representation is still inflexible when it comes to incorporating additional music-theoretic and cognitive knowledge, making model improvements impossible.

B. Comparison to Hu and Saul
The only music-specific application of topic models we are aware of is by Hu and Saul (2009). However, while sharing the general topic model structure with our TDM, they target a very different goal and there are several profound differences, which make a direct comparison difficult.
The main difference is that Hu and Saul address the problem of key finding under the standard assumption of two distinct modes (major and minor) with 12 possible values for the tonic (all neutral pitch classes in 12TET). This is contrary to the purpose of our model in two ways. First, we aim to provide a more finegrained representation of tonality -specifically, one that goes beyond the coarse classification into two discrete modes -by using weighted directions on the Tonnetz. Second, we argue that this should be achieved by providing a more structured and interpretable representation of the PCDs (the weights in our model), while Hu and Saul learn one fixed, unstructured distribution for each mode.
Another difference is that Hu and Saul add one additional layer to the standard topic modeling architecture, resulting in three layers (piece, section, note). As a result, they can infer key labels for different sections of each piece, which provides more detailed information. However, this does not necessarily improve accessibility because these separate key labels still require interpretation by a domain expert and do not compactly represent the global harmonic character of the entire piece (besides again adhering to the simple binary classification).
Finally, given their interest in traditional key finding, Hu and Saul compare their model's accuracy to established template-based key finders. In contrast, as our interest lies in capturing the fine-grained structure of PCDs, we use the KLD to the empirical distribution (or equivalently the cross-entropy or data likelihood) as a performance measure in our evaluation. Consequently, a direct comparison to Hu and Saul's results is hardly possible. We do, however, include a baseline model that is similar to their model in that it learns two key profiles for the entire corpus. Our evaluation shows that its performance is acceptable on Baroque pieces but strongly decays for pieces from the Classical and Romantic era. Furthermore, the learned profiles indicate that the pervasive assumption of a piece being either in major or minor mode is a coarse simplification, which is justified only in a narrow stylistic range.

C. The Tonal Origin
There are two important caveats to keep in mind for interpreting the tonal origin. First, assuming a single tonal origin does not mean that the piece cannot modulate between different keys having their own respective local centers. Rather, we assume that one pitch class exists, which governs the entire piece and can serve to explain the occurrence of all other pitch classes, among others by taking possible modulations into account. While local modulations may affect the perception of a tonal center (Cuddy and Thompson, 1992;Farbood, 2016), the assumption of a unique global key is generally reasonable for pieces up to the late Romantic epoch, which tend to have a well-defined and unique global key. In fact, our general model definition (see Section 4) also allows for more than one tonal origin and applying the TDM to a corresponding corpus of pieces is highly interesting future work.
The second important caveat is that the tonal origin in our model is not necessarily equivalent to the tonic of the global key. While in the hierarchical dependency structure of harmonic relations, the tonic of the global key is the structurally dominating and most stable center, this does not directly translate into the statistical properties of PCDs. The most important confounding factor in this respect is that only the relative frequencies of pitch classes are preserved but not the information on their temporal order. For instance, in a piece (say in the key of C) with a short sub-dominant section (in F major) and a short dominant section (in G major), the global tonic (pitch class C) will generally be the statistically most salient pitch class. However, the PCD of another piece in the key of F major with a short pre-dominant section on the second scale degree (G), and an extended dominant section (C) may look exactly the same. In the second piece, the most salient pitch class will again be C, which here is the tonic of the dominant. This example demonstrates that the global tonic cannot generally be identified based on the PCD alone. Instead, F would be identified as the global tonic of the second piece based on temporal information, such as the piece starting and ending in that key. Some approaches to algorithmic key finding therefore take only the beginnings or endings of pieces into account (e.g. Albrecht and Shanahan, 2013), effectively ignoring most of the tonal material. But precisely this temporal information is not retained in the overall PCDs and we thus cannot expect a model of PCDs to directly relate to these concepts.
The tonal origin in our model therefore has to be conceived as a more general statistical concept. Even though the global tonic is a good candidate in many cases, other pitch classes (especially those that are closely related to the global tonic on the Tonnetz) may be better suited to explain a given PCD.

D. Specific Assumptions for the Evaluation
For our evaluation, we make several assumptions that are specific to tonal music and well-established in music theory (see Section 3).
Our first assumption is that only a single tonal origin per piece exists and that all tones are a priori equally likely to become the tonal origin. As discussed above (see Section 3.4), this does not mean that the piece cannot modulate between different keys. Mathematically, this corresponds to the base distribution H c being uniform over the tonal space T and the concentration parameter α c being equal to zero. As a result, optimizing over c corresponds to finding the single best-matching tonal origin for the given piece. This optimization is further facilitated by the fact that transitions are defined in terms of intervals, which means that shifting the tonal origin simply produces a shifted version of the distribution with the same shape and thus does not require recomputing the distribution via dynamic programming.
Our second assumption consists in restricting the number of allowed intervals (i.e. the number of possible transitions in the cognitive process) to a finite set and assuming a uniform prior over their weights w. The Dirichlet process is thus replaced by its finitedimensional equivalent, a Dirichlet distribution, with all concentration parameters equal to one. Specifically, we use the six primary intervals present in the Tonnetz (see Figure 2) so that the transition probability p(τ ′ | τ, w) can be written as where τ, τ ′ ∈ T are TPCs on the Tonnetz and i ∈ {+P5, −P5, +M3, −M3, +m3, −m3} ⊂ I are the directed primary intervals of a perfect fifth, major third, and minor third, ascending and descending, respectively. This means that the transition probability p(τ ′ | τ, w) can only be non-zero if τ and τ ′ are neighboring tones on the Tonnetz. Accordingly, the weights w can be represented as a six-dimensional probability vector w = (p +P5 , p −P5 , p +M3 , p −M3 , p +m3 , p −m3 ). All other intervals, such as ascending or descending seconds, tritones, or augmented sixths, are expressed as combinations of these primary intervals.
Finally, we assume a uniform prior over the pathlength variable λ, that is, the rate in case of a Poisson distribution and the success probability and number of trials for a binomial distribution. Note that by choosing uniform priors over all piece-level variables β, the MAP estimate becomes equivalent to a ML estimate.