Not All Roads Lead to Rome: Pitch Representation and Model Architecture for Automatic Harmonic Analysis

Automatic harmonic analysis has been an enduring focus of the MIR community, and has enjoyed a particularly vigorous revival of interest in the machine-learning age. We focus here on the specific case of Roman numeral analysis which, by virtue of requiring key/functional information in addition to chords, may be viewed as an acutely challenging use case. We report on three main developments. First, we provide a new meta-corpus bringing together all existing Roman numeral analysis datasets; this offers greater scale and diversity, not only of the music represented, but also of human analytical viewpoints. Second, we examine best practices in the encoding of pitch, time, and harmony for machine learning tasks. The main contribution here is the introduction of full pitch spelling to such a system, an absolute must for the comprehensive study of musical harmony. Third, we devised and tested several neural network architectures and compared their relative accuracy. In the best-performing of these models, convolutional layers gather the local information needed to analyse the chord at a given moment while a recurrent part learns longer-range harmonic progressions. Altogether, our best representation and architecture produce a small but significant improvement on overall accuracy while simultaneously integrating full pitch spelling. This enables the system to retain important information from the musical sources and provide more meaningful predictions for any new input.

Some sense of 'tonal harmony' is common to a very wide range of musics, including most Western Classical music (since the earliest emergence of harmonic writing), as well as most jazz, pop, rock, and much more besides.
Unsurprisingly given their ubiquity, tonal scales, keys, and chords feature prominently from the earliest stages of many music theory pedagogies, 1 and have been the subject of much theorisation. Tymoczko (2011), for instance, takes a suitably expansive view of this broad spectrum of tonality, identifying five features that draw this diverse set of musics together: 1. 'Conjunct motion' 2. 'Acoustic consonance' 3. 'Harmonic consistency' 4. 'Limited macroharmony' 5. 'Centricity' These features are indeed descriptive of the Western repertoires mentioned: their melodies tend to move in conjunct steps (i.e. to adjacent notes) most of the time (1); the harmonies are centered on the consistent use of highly consonant triads and sevenths (2, 3); and those melodies and harmonies are organised in relation to scales which focus the predominant pitch usage across long spans on a limited collection (4) and center the passage on one primary pitch (5).
This focus on triads and sevenths specifically delimits the wider range: while many world musics can be described as 'tonal' according to the above definition, 'triads and sevenths' presupposes particular types of scalar and harmonic construction. This particular construction likewise poses a specific set of theoretical questions for how best to describe and understand those harmonies. Many solutions have been proposed, reflecting, in part, the extraordinary diversity to be found even within the narrower 'triads and sevenths' repertoires. Like those musics themselves, most such descriptive systems share a great deal of common ground but diverge considerably in their details. The two systems most widely used today are: chord symbol charts (as used prescriptively in lead sheets for jazz performance, for instance), and Roman numeral analysis (primarily used descriptively for the analysis of Classical music).
Like other representations of tonal harmony, Roman numeral (hereafter 'RN') analysis focuses on recording chords, specifying the triad quality (major, minor …), seventh (where applicable), inversion (bass note), and any Micchi, G., et al. (2020). Not All Roads Lead to Rome: Pitch Representation and Model Architecture for Automatic Harmonic Analysis. Transactions of the International Society for Music Information Retrieval, 3(1), pp. 42-54. DOI: https://doi.org/10.5334/tismir.45 * Univ. Lille, CNRS, Centrale Lille, UMR 9189 -CRIStAL -Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, FR modifications (such as added and altered notes). Unlike most systems, RNs also specify an analytical view of the local and global keys to which those chords belong and so also their harmonic functions (hence the term 'functional analysis'). Figure 1 provides an example of RN analysis (given in text below the lowest stave). The letters before the colons (here, C in bar 1 and G in bar 6) mark changes in key. The Roman numerals (I, ii, V …) indicate on which degree of the scale the harmony is built. The chords' qualities are given by the Roman characters' case (upper/lower) and the inversion is indicated by the Arabic numerals at the end. Whatever the representation system used, a chordal description of music involves a reductive view of the total pitch information for all but the simplest of cases. That is, in both prescriptive and descriptive contexts there will be 'non-harmonic' pitches that are in the music but not represented in the chords.
At least in the descriptive/analytical case, there may be many, different, equally credible readings of the same passage. This stems from the ambiguity inherent in mutually informative decisions over: • whether and where to change chords, • whether and where to change keys, and • which notes in the score should be represented in the harmonic reduction at all.
In practice then, while experienced analysts will generally agree over simple contexts, their analyses may vary widely for more complex cases. In short, our intuitive notions of what is 'in' the harmony hides a sophisticated set of judgement calls. Section 2 expands on this matter for readers unfamiliar with this kind of task. For now, we proceed to survey prior work attempting to automate this process.

Previous Computational Approaches
Befitting the fact that there have been formalisations of harmony throughout the history of music theory, there have likewise been attempts at computer-based modelling of this problem for as long as that has been practically possible. Early efforts include Steedman and Longuet-Higgins (1971) and Holtzman (1977)'s programs for deducing the key of a piece from its pitches. Krumhansl and Kessler (1982) subsequently integrated perceptual matters, and further improvements to algorithmic key detection include Temperley (1997Temperley ( , 1999; Madsen and Widmer (2007); Robine et al. (2008);Nápoles López et al. (2019).
Other efforts have focused on individual chords. Identifying what the chords are depends on the interconnected problem of determining when they change (Pardo and Birmingham, 2002), and thus both chordal analysis and generation algorithms have seen improvements by taking context into account (Paiement et al., 2005;Rocher et al., 2009;McFee and Bello, 2017;Ju et al., 2017Ju et al., , 2019. Studies taking on the automation of full functional harmonic analysis are more recent, perhaps because they require the simultaneous assessment of keys and chords. For instance, Illescas et al. (2007) demonstrate full RN analysis of the Bach chorale corpus, while Kröger et al. (2008) implement a system called 'Rameau' which combines four different algorithms for RN prediction (but which does not offer precise comparisons between them). Following Schenker (1935) and Lerdahl and Jackendoff (1983), several scholars have proposed hierarchical models for encoding harmonic functional relationships (De Haas et al., 2009;Harasim et al., 2018), often with visualisation methods among the goals (Sapp, 2005;Rohrmeier, 2011).
The major development of the last few years is the application of machine learning techniques to the task of RN analysis Su, 2018, 2019), partly due to the rapidly growing provision of relevant corpora (see below). Machine learning methods would seem to be a good fit for the task of RN analysis as the constituent problems involved (identification of keys, chords, and functions) are deeply related but in complex ways. For example, while we know that there are regularities to what is 'in' the harmony, we can pin down the specifics only so well using rule-based algorithms.

Analysis Datasets
In the last decade, several corpora of human harmonic analyses have been published, spanning classical, jazz and pop/rock repertoires. Among these, the most relevant to the present study on functional harmonic analysis are those datasets expressed in RNs and focussed on Western Classical music: • 'TAVERN' (Devaney et al., 2015), • 'ABC' (Neuwirth et al., 2018), • 'BPS-FH' (Chen and Su, 2018), and • 'Roman-Text' (Tymoczko et al., 2019). Table 1 summarises the scale and repertoire focus of these corpora, and Section 3.1 discusses the slight variations in the standards used. RN notation was initially designed for Western Classical music and while it can be (and is) profitably applied to wider repertoires such as pop/rock (see for instance Duinker (2019)), datasets on harmonic analysis of that wider repertoire do not generally include functional labels. Instead, they specify chords directly, in the style of lead sheets: • ' annotated jazz chord progression corpus' from Granroth-Wilding and Steedman; 2 • de Clercq and Temperley (2011)  And the full set of relevant datasets is wider still, with some offering chordal information among other parameters: • DDMAL's Billboard Project, 7 with chords, structure, instrumentation, and timing annotations of Billboard chart hits; • the 'iRb' Jazz Corpus; 8 • C4DM's Isophonics datasets. 9

Aim and Contents
While the type and degree of descriptive detail involved in RN analysis may be more or less appropriate depending on the musical circumstance, most other representations of harmonic analysis can be derived from the details held within it. As such, an automatic system that takes in a musical source and returns full RN analysis constitutes a defining benchmark for performance in any aspect of automatic harmonic analysis. This paper sets out our attempts to realise that goal.
Section 2 completes the motivation for this study and approach by setting out some specific examples of the ambiguities involved in harmonic analysis, Section 3 proceeds to the method used, Section 4 turns to the results and some interesting edge-cases, and Section 5 provides an outlook.
All software developed for this project is freely available under an open-source licence at https://gitlab.com/ algomus.fr/functional-harmony.

On Functional Harmonic Analysis
Many scholars have offered heuristic preference rules for approaching the task of harmonic analysis. For instance, Tymoczko et al. (2019) suggests the possibility of preferring: 1. harmony changes on metrically strong positions and at regular intervals; 2. to analyse similar material in similar ways; 3. to identify as 'harmonic' notes that do not belong to any common species of non-harmonic tone (e.g. notes that are both leapt-to and leapt-from); and 4. harmonic analyses that are more consistent with standard harmonic theory.
These rules align neatly for simple cases such as Bach's iconic prelude BWV846 (see Figure 1), pointing in this case to harmony changes once per measure. There may be some disagreement about where to mark the changes of key (see discussion in section 4), but the changes and membership of the chords are mostly straightforward. Indeed, there are arguably no non-harmonic tones until measure 23 (see Figure 2). Here, in order to separate harmonic from non-harmonic, we have to select between two (or more) possible options: F minor (with F, A♭, and C in the chord, excluding B, D) or B diminished 7th (with B, D, F, and A♭ in the chord, and C eliminated). The preference for leaping to consonant notes would guide us towards the later view, though credible arguments can be (and have been) made on both sides on the basis of the wider progression.  The Bach example begins to show that more complex contexts can run these rules into self-contradiction. It quickly becomes impossible to determine a system of priorities among those rules that will generalise to all musical cases. Instead, analysts may take these rules for 'in principle' guidance, but must make complex judgement calls to arrive at a preferred solution, knowing that it is one among several viable options. This is a strong incentive for exploring automated systems which can similarly handle such ambiguity, without depending on a hierarchy among explicit, deterministic rules.
Measures 34-5 of 'Einsamkeit' from Schubert's Winterreise cycle (D.911, No.12) provide an example of how these preference rules can lead us to at least three different analyses. Figure 3 shows the score along with three analyses below as 'lyrics'. The rest of this section talks through those analyses in relation to the 'rules' outlined above, and Table 2 provides a summary.
Measure 34 marks the completion of an uncontroversial B minor cadence. Thereafter, if we prefer notes at metrically strong positions (rule 1), we start with a tonic triad ('i': B, D, F) and thus emphasise the dominant note (F) on the downbeat of m.35. Let's call this Analysis A1. However, that view contravenes rule 3 as both G3 and E3 are leapt-from/-to and yet excluded from the chord.
Alternatively (Analysis A2), we could take the leaps in the melodic line to indicate chord membership (rule 3). That radically changes the outcome: now F goes from a position of priority, to being the only note we would exclude (as a passing note between the E and G natural). This leads us to the radically different chordal reading of a German 6th (G, B, D, E), and the need to select a starting point and inversion. Among the possibilities here, the figure shows a change to Ger42 on the note D; this parallels the change of chord (within dominant function) at the earlier occurrence of this material in m.22 (rule 2). The pivot to (and pre-dominant of) the B minor cadence at the end of m.32 (not shown) was a clear German 6th and in the ' correct' inversion, thus putting the chord 'in the listener's ear' and proving further precedent for this reading (rule 2 again).
Perhaps the following chords can provide helpful context? A modally mixed viio 7 -I of G major/minor follows, with either the E♭ (in viio 7 ) borrowed from G minor into G major, or else the B♮s (in 'I', but also the vocal line of m.36) borrowed from the major into the minor. Informally, it would appear that minor borrowing into major mid-piece is more common in the target repertoire (rule 4), 10 though viio 7 comes first here, and thus might be thought to take priority. As for mm.34-35, this subsequent move to G major (or is it minor?) may encourage us to take m.35 as G major, yielding a I to initiate and balance a I-viio 7 -I progression in G (Analysis A3). That accounts for the Gs but the E is either back to being a leapt-to chromatic lower neighbour note, or else included in a chord with the B2 and D3, but no G (much less a better-fitting G), which gives us no kind of tonal triad at all. 11 In cases like this, we will all have views on how to proceed but no one can claim to have the single, definitive, and unequivocally ' correct' answer.

Method
This section outlines the nature and content of the training materials (Section 3.1), the different encoding options for both input and output data (3.2), decisions over whether and how to use transposition for data augmentation (3.3), and finally the details of the network architecture, implementation and training (3.4).

Meta-corpus
We prepared a meta-corpus of harmonic analyses, combining all previously published corpora of RN analyses as discussed above and itemised in Table 1. To bring these corpora together, we developed a set of new open source converter tools which we offer to the community. The RN annotations and these tools are available at https://gitlab. com/algomus.fr/functional-harmony, with links to the corresponding scores. Altogether, the corpus comprises 201 scores and over 70,000 RN annotations.
We sought to convert each representation standard directly, without changing or interpreting those analyses except in case of clear errors. Details are available on request, and we are in the process of integrating the conversion code into music21's sub-library for parsing Roman text to support future efforts bringing corpora together (Cuthbert and Ariza, 2010). For the ABC corpus, we used the version reported by Tymoczko et al. (2019). 12 In all cases, the .rntxt files set out the identity of analysts, proof-readers, and converters involved, and the original datasets are available online for comparison. of Schubert's 'Einsamkeit' (see Figure 3). The analyses are written in .rntxt format (Tymoczko et al., 2019), as explained in Section 3.1. The 'rules' in the second and third column are set out at the beginning of Section 2. Different Annotators. Among these datasets, 'TAVERN' is the only one to include more than one alternative reading of the same piece by different annotators, a feature that is useful for communicating to the algorithm that there can be multiple valid readings of the same passage. For example, some annotators may prefer to define fewer chord changes in order to emphasise longer-range structure of the piece (excluding many notes as non-chord tones); others may include more of those notes, leading to a narrower focus on momentary changes. In this study, for the sake of simplicity, we elected to treat each of these analyses independently. It is clearly not quite right to treat multiple analyses of the same music as equivalent to analyses of separate pieces, and doing so will introduce some bias in the model; however, we consider this a small detraction relative to the gain in variance afforded by the alternative readings.

Rules followed/broken
Additionally, while the other datasets consistently offer one analysis per piece, by drawing them together, we have integrated a broad range of analytical perspectives. Again we consider that diversity an asset above and beyond the simple gain in scale, though we are not in a position to make any claims towards a 'balance' of approaches represented, and certainly not to 'representativeness'. For instance, each of the original datasets focuses on a different style, and so there is a non-separable correlation between annotators and musical genres. Future work could explore interannotator stylistic variance, in order to get a data-driven sense of the variety of approaches and how best to balance them as the provision of corpora continues to grow. Data Formats. Whatever the original formats of our data, we have elected here to use three formats which we find collectively offer the best balance between uniformity and suitability to the range of tasks involved. For analysis input, we recommend the human-readable and music21parseable 'Roman text' (.rntxt) format (Tymoczko et al., 2019); for the presentation of results aligned with scores, we offer .json files that can be interpreted and visualised by Dezrann (Giraud et al., 2018); and for machine learning, we prefer a tabular representation based on that originally proposed by Chen and Su (2018).
This last, tabular format encodes RN analyses according to six properties: 1. Start offset: the beginning of the annotation in question as measured from the start of the score in ' quarter length' (1 = 1 quarter note); 2. End offset: an equivalent for where the annotation ends (usually coincident with the start of the next entry); 3. Key: tonic, specifying full pitch spelling (so that G ≠ A♭) and mode (uppercase for major; lowercase for minor); 4. Quality: for example, major or minor triad; major, minor, or dominant seventh; 5. (Scale) Degree: from 1 (the tonic) to 7 with the potential for accidental modifications (e.g. 4) and/or secondary, 'tonicised' degrees (5/5); and 6. Inversion: counting from 0 (root position: bass note = chord root) to a maximum of 3 (thus supporting all inversions of seventh chords, but no ninths). Table 3 sets out the beginning of the Bach prelude from Figure 1 with the Roman text and tabular representations aligned for comparison.

Encoding Input and Outputs
Identifying best practice in the encoding of musical information for machine learning is an open problem (Huang et al., 2018;Briot et al., 2020). One consideration we know to be highly relevant in determining the best approach is the size of the dataset. While our dataset is larger than previous efforts, it is still small by the standards and requirements of machine learning. This redoubles the significance of the representation format: we want to include all relevant information, but the more we compress that information, the fewer parameters the system has to learn, and the more one can achieve with a smaller dataset. This section addresses the three primary aspects of data encoding for our purposes: time, pitch, and RN representation.
Time. The literature proposes two main approaches to time encoding. The first (Oore et al., 2018) represents the score as a series of three possible event types: note on, note off, and time shift (following MIDI conventions). A time shift event defines the distance between two successive note events. This representation overcomes certain problems particularly common in music generation tasks, 13 but it can conceal the music's metrical structure, which is important in harmonic analysis. Much better represented in the literature is the alternative 'frame-based' encoding method, where each input vector denotes an individual time frame. Most studies on symbolic music opt for some factor-of-two multiple for the smallest slice (1/8th, 1/16th, or 1/32nd notes) and accept the errors that this will entail for shorter values and for all triplets (which are quantised to binary positions).
We follow this latter practice for equal-duration, binary division frames. In our case, we use a 32nd note for input encoding (notes) and 8th note for the output (chords), as the harmonic rhythm is almost always (much) slower than the surface rhythm. Finally, we divide all scores in segments of equal quarter-note duration and pad with zeroes to the right when needed.
Pitch. The options for pitch encoding may be set out in two dimensions. The first accounts for pitch spelling. Here we must choose between using pitch class representations (12 per octave, and no difference between the enharmonic equivalent pairs like G and A♭), or maintaining the full pitch spelling (with 21 possibilities per octave for single sharps/flats and 35 for double). 14 The other dimension concerns registral information. Keeping octave information leads to richer data, but excluding it would be more compact. We propose a third, ' compromise' option reflecting the special role of the bass in tonal harmony in defining both chordal inversion and other important matters for harmonic progression. In this case, music is encoded with two vectors per frame: one with the lowest note and another with the total pitch content. The fact that the lowest note may not be indicative of the bass is one of the many tasks that the system would need to learn. Table 4 sets out these options with their relative size for the case of a 7-octave space and chromatic spellings of up to double sharps/flats.
Regardless of the pitch space chosen, we define a Boolean matrix with time frames on one axis and pitches on the other: The value is 1 if the pitch is present in that frame, and 0 otherwise. In this encoding, multiple pitches may be activated in the same time frame where they sound simultaneously in the source (as in chords, for example). This data representation reduces to the familiar piano roll notation when using CPf for the pitch space.
One potential shortcoming of such a frame-based encoding is that it fails to distinguish between repeated and held notes. Hadjeres et al. (2017) and Liang et al. (2017) include special symbols to disambiguate this on voiceseparated music. When the number of voices is not fixed, one symbol per note is required, doubling the size of the input vector. We decided not to encode that information partly due to the loss of compactness, but also because we do not expect distinguishing tied from repeated notes to be especially important for harmonic analysis. (2018) we output the harmonic analysis with six labels: Key, Degree 1, Degree 2, Quality, Inversion, and Root. The two labels for scale degrees handle cases of tonicisations in the format 'Degree 2/Degree 1'. The labels for keys and chord roots depend on the choice of the input representations. For all CP cases there are 12 possible chord roots and 24 keys (12 major and 12 minor). When the input is in a PS encoding, the number of possibilities increases: there are 35 roots and thus 70 keys for the double sharp/flat condition. 15 There is some redundancy built into this system as it is possible to derive the root unambiguously from other features. However, learning redundant variables can be helpful to the algorithm's success. The division of each RN label into six independently-computed sub-labels reduces the complexity of the task, since the total number of possible outputs for our best-performing representation is Σ i c i = 123 ≪ ∏ i c i ≈ 22⋅10 6 , where c i is the number of output classes for each separate target label. It also improves the interpretability of the results, allowing one to focus on each aspect separately.

RN Output Labels. Continuing to follow Chen and Su
This comes at the cost of a potential for self-contradictory outputs in which the six sub-labels have different ideas about the chord. In practice, we find that this is only rarely a problem, arising in the particular case of the 'no chord' label used by the ABC dataset (only) for passages with rests and/or single line melodies. Given the inconsistency in the source data, we do not include a provision for the 'no chord' case. Instead, we fill any such gap with a continuation of the foregoing chord, except in the case of beginnings, for which we start the first chord early.

Data Augmentation by Transposition
In practice, keys are not used equally. It is common in both analysis and generation tasks to augment the dataset by transposing it to multiple keys (Huang et al., 2018;Chen and Su, 2019). While a single piece in two transpositions should not be considered equivalent to two distinct pieces (for reasons somewhat analogous to the status of multiple analyses of the same piece discussed above), transposition does stand to augment considerably the overall size of the dataset.
While working within the 'CP' encoding space (as is the case for all work based on MIDI), there are only 12 distinct transpositions: one for each distinct pitch class.
When including pitch spelling, transposition moves not through a circle, but a spiral, potentially infinitely. Clearly, some constraint is required to limit this pitch space. We define two such constraints, both based on the 'spiral of fifths' where pitch objects (usually keys) are set out according to their relative flat-/sharp-ness.
Our first constraint limits the pitches to double flats/ sharps from F ♭ ♭ to B. To enforce this constraint, we need to retrieve the ' chromatic ambitus' of each piece, delimited by the 'flattest' and 'sharpest' pitches used. For instance, Schubert's 'Einsamkeit' (Figure 3) ranges from E♭ to E, meaning that it can be transposed by 12 further steps in the flat direction and 8 steps sharpwards while still remaining within the set limit of double sharps and flats.
Our second constraint limits the keys to a narrower range from C♭ to C majors and their relative minors (A♭ to A) such that the diatonic pitches are limited to single flats/ sharps. We do this to reduce the computational load without losing actual information, as real pieces very rarely go outside these key boundaries. For this constraint, we need to look at the chord labels. In the Schubert example, by almost any reading, the sharp-most key used is B minor (5 steps away from the limit of A on the spiral of fifths), and the flatmost is C minor (4 steps away from A♭). This procedure favours pieces with limited modulations. Figure 4 sets out the range in our corpus. The majority of pieces can be transposed 10-13 times, within an overall range from as few as 3 transpositions (for highly chromatic works) to as many as 15 (for pieces that never leave their home key or its relative). The more harmonically adventurous pieces are thus also the least numerously represented. As a possible, partial solution, one could transpose segments of the score separately. As the chromatic and key range of each segment is necessarily less than (or occasionally equal to) that of the overall work, these sections would be transposed more times.

Network Architecture
We propose a neural network architecture that divides the process of RN analysis into two separate but interconnected parts (see Figure 5). The first part analyses the local context with a window size of 2 quarter notes. This corresponds to the human analyst distinguishing between harmonic and non-harmonic tones, producing a chordal reduction and deriving the Quality, Inversion, and Root labels. The second part, in turn, focuses on the more global matters of chord progressions and key selection. The RN analysis emerges from the structure and pattern of those progressions, expressed in the Key, Degree 1, and Degree 2 labels.
The local part (Conv) is a 1-D implementation of the convolutional architecture 'DenseNet' (Huang et al. 2016). We convolve along the time domain and encode pitches as different feature maps on independent channels (analogous to different colour channels in image analysis). One particularly distinctive and relevant feature of DenseNet is the preservation of the same feature maps for multiple convolutional layers in order to analyse the same information at successive levels of abstraction. This allows the network to keep some important information in memory instead of having to learn how to store it anew every time. The DenseNet also contains pooling layers that we use to pass from the time resolution of the input notes to those of the output chords.
For the second part of the network, we experimented with two alternative architectures: the first is a dilated convolution (Dil), of the type introduced by Yu and Koltun (2015) and adapted to 1-D data by Oord et al. (2016). We use a non-causal dilated convolution, meaning that we allow the system to use both past and future events when determining each chord. The convolution is made of 4 layers with 64 kernels each of size 3 and a dilation of 3 l , where l is the layer index. This means that each prediction can use information from a total context of 3 4 = 81 eighth notes: the present one as well as 40 from the past and 40 from the future. In most cases, this should be ample context for analysing chord progressions. This architecture is fast and scales well with the length of the input segment, both in terms of speed and reliability.
The second option for the global part of the architecture is a bidirectional recurrent neural network using gated recurrent units (GRU) (Cho et al., 2014). 16 Being bidirec tional, this method also uses information from both past and future frames, though the process differs from that of the dilated architecture. The hidden state is made of 64 neurons per direction and uses a dropout rate of 0.3. This architecture is more expressive than the dilated convolution since it allows for correlations of theoretically infinite length. Further, the gated internal structure is more sophisticated than those in dilated convolutions, allowing for the discovery of more complex correlations. That said, it is also harder to train and scales poorly with the length of the given input segments. Therefore one needs to strike a balance Figure 5: Architecture of the neural network model in the 'local' training mode. When 'global', Quality/Inversion/Root outputs are computed after the fully connected layer instead. The numbers in the boxes refer to the number of categories for each output label in the PSb case (see Table 4).  Figure 4: The distribution of work transpositions that remain within the set limits of F♭ ♭-B for pitches and C♭-C for keys.

short-range
in terms of segment length: segments must be long enough to take advantage of the recurrent nature of the network, but short enough to make training feasible. We elected to divide the scores in non-overlapping segments of 80 quarter notes' duration. With either architecture, the second part ends with a fully connected layer of 64 neurons. Each label is predicted by a fully connected layer with softmax activation, whose size is determined by the number of classes for the label at hand. The network is trained end-to-end to ensure strong connection between the local and global tasks. The loss function used is the standard categorical cross-entropy loss. This is computed on each of the six target labels separately before the results are added, with an equal weighting.
As a baseline for comparison of these two approaches, we also trained a standard GRU model without local context analysis. We refer to this as PoolGRU as it is preceded by pooling layers to reduce the resolution on the time axis.

Network Training
We randomly allocated 90% of the available scores to the training set, reserving the remaining 10% for validation. Importantly, we implemented this proportion not only for the corpus overall, but for each of the corpora individually. For the special case of TAVERN, those works assigned to the training set included the score and both of the corresponding analyses; pieces in the validation by contrast included only one of the analyses (randomly selected). In order to provide direct comparison with Su (2018, 2019), we also calculated results using only their dataset, divided in the same way.
We trained in two ways. In the first (global) approach, all six labels are predicted at the end of the second part (Dil or GRU). In the second (local) method, the Quality, Inversion, and Root labels are determined at the end of the first, local part of the network and used to determine the key and degree. As discussed, we did not enforce consistency between labels. Given the lack of local context, PoolGRU can only be trained in global mode.
The network was encoded in Python v3.7 using Tensorflow v1.14. The code is available at https://gitlab.com/algomus. fr/functional-harmony and was initially forked from Chen and Su (2018), but all dataset conversions, encodings, and models are original work. Our best model has about 94,000 trainable weights in total: 33,000 for the local part, 43,000 for the global, and the remaining 18,000 for the fully connected layers. Depending on the model, the total training time ranges from 20 minutes to 3 hours, when run on a CPU-only high-performance-computing server.

Overall Metrics
The results for our best model (ConvGRU, PSb, global learning) are summarized and compared with Su (2018, 2019) in Table 5. The first row sets out the results obtained by our best model using the new metacorpus, while the second row reports on results limited to Chen and Su's own dataset for direct comparison. At a glance, one can see that our proposal achieves a small but significant improvement over the previous state-of-the-art while also taking full pitch spelling into account.
Our emphasis here is on comparing the different encodings and the architectures, and on attempting to identify edge cases. To that effect, we have trained on all possible combinations of the six pitch encodings, the two architectures (and the baseline), and the two training types (except for PoolGRU, which is only applicable for global training). Table 6 presents the results averaged over all of the models.
As the table shows, ConvGRU (mixing local analysis and a GRU unit) is the best performing architecture, surpassing the two alternatives, with a particularly significant improvement over PoolGRU. Indeed, a t-test on the significance of the difference for the full task (the column 'RN' in Table 6) yielded a p-value < 10 -2 against the null hypothesis of ConvGRU and PoolGRU giving the same result.
As for the pitch encoding, including the bass information (CPb/PSb) results in markedly higher performance not only in identifying the correct inversions, but for all of the tasks (again, p-value < 10 -2 ). On the other axis of pitch representation, using full pitch spelling generally leads to slightly higher results overall, but the results are not statistically significant. That said, we must remember Table 5: Comparison of the percent accuracy between models. The two rows above the internal division report on our best model -ConvGRU with pitch spelling and bass (PSb) and with global training. The first row reports on training with all available data; the second reduces the available data to the smaller corpus used by Chen and Su (2018). Rows below the internal dividing line provide comparison data for the performance of Su (2018, 2019), as well as a baseline key detection using pitch profiles by Temperley (1999). 'Degree' registers as correct only when the predictions match the corpus entry for both Degrees 1 and 2; 'RN' is correct only when all four of the previous columns match in that way.  Temperley (1999) 67.0 that analyses without pitch spelling cannot distinguish between enharmonically equivalent keys like G and A♭. As such, the inclusion of spelling means introducing more keys and chord roots and thus amounts to a more difficult task where proportionately fewer answers will be correct. As pitch spelling yields performances that are not worse while performing a harder, more musically relevant task, we conclude that the spelling representation is preferable where the data is available.
Comparing local and global training yields a much more ambiguous result that invites further study. The difference in the total result is statistically not significant. However, when one looks at specific (local) labels such as the quality, one finds that the differences in the intermediate steps taken by the two architectures are significant (with a p-value against the null hypothesis smaller than 10 -6 ).

A Closer Look at the Music
As this discussion of overall accuracy metrics would seem to indicate, there is more to the task of evaluating the results of an RN analysis. As such, we continue here to take a closer look at the ' errors' made by the models. These ' errors' -or, more properly, divergences between the input corpus and prediction -appear to centre on three main types: 1. Segmentation errors: differences in the timing of chord changes (see Bach prelude, Table 7). This appears to be the most common discrepancy. More specifically, we notice that the predictions tend to change more frequently than the human analyses, particularly in more complex passages. This is presumably on the basis of an attempt to divide the music into small enough segments to allow a cleaner reading of the chord in those small spans. Strategies such as the segmenter layer proposed by Chen and Su (2019) may help.
2. Mislabeling of rare chords: the system is highly reluctant to identify secondary/tonicised chords or chromatic chords like the augmented sixths, presumably because they are relatively rare in the corpus. 3. Alternative readings: moments where the system opts for a reading that is different from that of the validation corpus, but which is nonetheless a perfectly acceptable alternative. Corpora with multiple readings of the same music would be especially helpful here because they offer the system not a single 'correct' answer, but a list of viable options.
Once again, the extract from Schubert's 'Einsamkeit' (discussed in Section 2 and shown in Figure 3) offers a neat example of all three issues. Our reference analysis corresponds broadly to analysis A2. Regarding issue 1, the prediction for measures 34 and 35 is made of four different chord labels, while in the reference dataset there are only two. This is strictly connected with issue 2, as the 'mis'-labeled chord is a German sixth, unidentified by our system. Lastly, 'the different but acceptable' reading is pertinent in the case of measures 36 and following, which the dataset analyses in terms of G minor, the system views in G major, and is in fact an ambiguous mixture of the two (as discussed in Section 2). Finally, we found some cases we consider unacceptable readings, where the most compelling musical reading diverges from the statistically normative case. For example, in Beethoven's sixth sonata (op.10 no.2, Figure 6), the exposition includes a theme in C major which from measure 41 is repeated in the parallel key of C minor. Perhaps because this lasts for only four measures, the system is reluctant to identify a full modulation, preferring instead to remain in C major.

Improvements
A simple right/wrong accuracy metric is not the best way to measure the performance of an RN analysis algorithm, as several different readings are often equally viable. Even taking this into account, the 43% total accuracy that we report is still far from ideal. In this concluding section, we propose some ideas for improving these results. As mentioned above, this field would benefit greatly from larger datasets, covering both a wider repertoire and multiple, alternative readings of the same works.
It would also be useful to explore wider encoding options, and not just for pitch and time: for the repertoires discussed here, metrical position, dynamics, texture, and other score indications are also strongly attested to have a bearing on harmonic analysis. Including those parameters may improve performance, though informal testing of metrical strength did not yield significant gains, and the quality of machine-readable score encoding often prohibits a serious analysis of parameters like dynamics.
In the time domain, comparisons could include assessing the relative performance of the 'frame-based' approach with the alternative 'variable length' convention (Oore et al., 2018). This latter allows representation of arbitrarily short and long time spans and would save on training time (by virtue of it reducing the total number of entries). It may also better reflect the human experience of music, which does not proceed in granular units, but centres on the information density of events and changes.
In the pitch domain, it would be interesting to define a space for the relative proximity of pitches and to add a second convolutional dimension on that space. This may lead to an improvement on the current model of using independent channels. Most simply, this could involve implementing the 'line' or 'spiral' of fifths (mentioned above in connection with transposition), and proposals for more complex spaces to explore abound. 17 Relatedly, the evaluation of output could be improved, perhaps through the definition of relative distance in functional terms. This would entail a distance metric between chords to write a more 'musically relevant' loss function which considers chords of the same function (such as ii7 and IV) to be closer to one another than to those of a different function (V7). This could also prove helpful for cases with multiple annotators, providing a metric for the relative divergence between those readings.
Additionally, while interpreting the results of a machine learning method is always difficult, this would help to advance our understanding both of the processes in operation here, and by extension, of harmony itself. One possible way of accomplishing this is to follow the activation of the neurons in a set of simple example cases.
Finally, one could also explore a combination of learned and/or deterministic post-processing to enforce the kind of consistency between labels discussed above. It may be that approaches combining machine learning, deterministic algorithms, and a human-in-the-loop achieve results surpassing those accomplished by each of these methods separately.

Applications
We view the whole endeavour of (semi-)automated harmonic analysis as a means to the end of understanding harmony better. As such, one goal is to produce harmonic analyses at a sufficiently high quality level that they constitute a reliably usable dataset -and object of studyin themselves. This would enable us to scale up questions of how harmonies 'tend to' be used, enabling the field of corpus analysis to realise its potential.
We consider it important that the models we proposed can be adapted to other kinds of harmonic analyses. As discussed above, harmonies in lead sheets have a different ontological status from RNs, and the repertoires represented are stylistically divergent, but the technical problem is comparable and often contained within the framework of the RN analysis.  Table 3) and our system's output (right). Discrepancies between the input and output analyses are highlighted in italics.