The Annotated Mozart Sonatas: Score, Harmony, and Cadence

This article describes a new expert-labelled dataset featuring harmonic, phrase, and cadence analyses of all piano sonatas by W.A. Mozart. The dataset draws on the DCML standard for harmonic annotation and is being published adopting the FAIR principles of Open Science. The annotations have been verified using a data triangulation procedure which is presented as an alternative approach to handling annotator subjectivity. This procedure is suited for ensuring consistency, within the dataset and beyond, despite the high level of analytical detail afforded by the employed harmonic annotation syntax. The harmony labels also encode contextual information and are therefore suited for investigating music theoretical questions related to tonal harmony and the harmonic makeup of cadences in the classical style. Apart from providing basic statistical analyses characterizing the dataset, its music theoretical potential is illustrated by two preliminary experiments, one on the terminal harmonies of cadences and the other on the relation between performance durations and harmonic density. Furthermore, particular features can be selected to produce more coarse-grained training data, for example for chord detection algorithms that require less analytical detail. Facilitating the dataset’s reusability, it comes with a Python script that allows researchers to easily access various representations of the data tailored to their particular needs.


Introduction
Polyphonic music is typically characterized by its harmonic makeup. The study of (tonal) harmony thus occupies a prominent position in musicological research. Owing to the growing availability of machine-readable datasets of harmonic analyses (see Section 2), harmony can now be examined across different styles and periods using advanced computational and empirical methods (e.g., Quinn and Mavromatis, 2011;Broze and Shanahan, 2013;Temperley and de Clercq, 2013;Jacoby et al., 2015;Chen and Su, 2018;White and Quinn, 2018;Moss et al., 2019). The main contribution of this paper is the introduction and description of a new corpus of annotated scores under the FAIR principles of Open Science (Wilkinson et al., 2016). The corpus consists of the digital scores of all 18 piano sonatas (1774-1789) by Wolfgang Amadé Mozart which have been annotated by music theory experts on three levels: harmony, cadences, and phrases. The chord and phrase labels follow the DCML harmonic annotation standard, 1 whereas the cadence labels are based on the typology and definitions by Caplin (2004) and Rohrmeier and Neuwirth (2015). The data is provided not only as annotated scores but also as feature matrices representing notes, bars, and annotation labels. These representations, together with a script for easy and versatile data access, enable detailed note-level analyses of a prominent sample of tonal harmony from the so-called classical era. The data is being published under a Creative Commons License and is available at github.com/DCMLab/mozart_piano_sonatas.

Harmonic Datasets
Among the growing number of datasets featuring analyses of harmony, one of the most influential is the Kostka-Payne Corpus 2 compiled by David Temperley (2009). This dataset has been used, among other things, to support a particular theory of harmonic syntax (Temperley, 2011), as a ground truth for automated harmonic analysis (e.g., Pardo and Birmingham, 2002), and for estimating the abstract harmonic categories underlying surface chords in a Hidden Markov Model (White and Quinn, 2018). There are, however, two main drawbacks inherent to this dataset: first, although it covers Western tonal music from ca. 1750 to 1900, it is very small, involving only 919 labels. Second, this dataset is derived from musical excerpts taken from a well-known music theory textbook and hence may corroborate a particular theoretical standpoint (that one may or may not agree with).
Further (and more recent) datasets enabling researchers to study classical harmony are the Joseph Haydn Harmonic Analysis Annotations Dataset, 3 the Annotated Beethoven Corpus, 4 the Beethoven Piano Sonata with Functional Harmony dataset, 5 and the TAVERN corpus of harmonically annotated theme-and-variation movements. 6 All these datasets have in common that they are bound to the oeuvres of particular (and prominent) composers.
In addition, there are several medium-sized datasets that allow scholars to examine harmony in Rock and Pop idioms. The largest among them is the McGill Billboard corpus 7 that consists of 743 transcriptions of popular music in the US between 1958 and 1991 and has been used, for instance, by Burgoyne et al. (2013) and Gauvin (2015). Similarly broad in scope is the Rolling Stone 200 Corpus, 8 which is based on Rolling Stone magazine's list of the "500 Greatest Songs of All Time" and has been analyzed by Temperley and de Clercq (2013). More specific Rock/Pop idioms have been studied using the Harte Beatles corpus. 9 A recent study by Moss et al. (2020) deals with harmony in 295 pieces in the Choro Songbook, providing the first quantitative style analysis of an idiom that emerged in Brazil at the end of the 19th century.
Apart from symbolic datasets of harmonic analyses, there are also numerous datasets of audio recordings that have been used for inferring harmonic characteristics. For instance, Zalkow et al. (2017) explore the notoriously complex tonal harmony in Richard Wagner's Ring cycle, relying on chroma features. Mauch et al. (2015) study the stylistic evolution of Pop music between 1960 and 2010, drawing on harmony as a prominent feature. Weiß et al. (2019) examine, among other things, the evolution of harmonic progressions on an even larger time-scale of several hundred years. Audio-based studies of harmony have the obvious advantage that they can, in principle, consider massive amounts of data since time-consuming human annotations play a smaller role compared to symbolic corpora. The flipside is, however, that they do not reach the level of detail and context-sensitivity that may be desirable from a music theoretical point of view.

Cadence Datasets
Within the domain of modal and tonal harmony, cadences act as salient structural patterns used to achieve closure on multiple hierarchical levels. Harmonically, these patterns are organized in different temporal phases: initial tonic ⟹ predominant ⟹ dominant ⟹ tonic. Each of these stages can be realized by selecting one or more chords fulfilling the proper functional criteria. Apart from the harmonic content, the closural effect of cadences crucially depends on further criteria, in particular structural voiceleading patterns (bass and soprano), the (hyper)metrical placement of a cadence, and its positioning in the form.
There is a growing number of theoretical and computational studies focusing specifically on the use of cadences in the "classical" repertoire (e.g., Caplin, 2004;Duane, 2019;Ito, 2014;van Kranenburg and Karsdorp, 2014;Neuwirth and Bergé, 2015;Sears, 2017a;Sears et al., 2018). However, the automated detection of cadences in music scores (Bigo et al., 2018) is as of yet not accurately feasible, mainly for two reasons.
First, there is the lack of formally concise definitions of cadence. Cadences are seen to emerge from coordinated activities of harmony, voice-leading, rhythm, and meter that are difficult to disentangle and have therefore been accounted for from a schema-theoretical perspective (e.g., Temperley, 2004;Gjerdingen, 2007).
Second, there are as of yet only few (mainly small) datasets available that contain cadence labels (e.g., Ito, 2014;van Kranenburg and Karsdorp, 2014;Duane, 2019;Sears, 2017a). For instance, Sears et al. (2018) provide and explore a comparatively small dataset of 270 cadence tokens in 50 selected string quartet expositions (1771-1803) from Joseph Haydn's oeuvre. As a result, analysts can examine the use of cadence only within one particular formal context, namely sonata form.
Similarly, Duane and Jakubowski (2018) confine themselves to exploring cadences in first-movement string quartet expositions (apart from Haydn also by Mozart and Beethoven). Two annotators were involved in creating this dataset; the intersection of the annotations were used for learning cadential categories in both supervised and unsupervised contexts (based on scale-degree distributions).
Note that none of these cadence datasets are accompanied by explicit and exhaustive harmonic analyses. Both issues, size and analytical richness, are tackled by the cadence dataset introduced in the present paper, which is not only much larger than previous datasets but is also supplemented by detailed harmonic annotations. As a result, it constitutes a valuable resource for investigating the complex interplay of structural components and has the potential of advancing our understanding of the harmonic nature of cadences. Further, it invites systematic comparison with the above-mentioned datasets.

Digital Scores
The dataset comprises the scores of the 54 sonata movements according to the Neue Mozart-Ausgabe (NMA) (Plath and Rehm, 1986) in the uncompressed XML format of the open source notation software MuseScore 3, thus providing an alternative to Craig Sapp's **kern edition of the Alte Mozart-Ausgabe (AMA). 10 Compared to the AMA, the NMA introduces the additional sonata K. 533/494 and reflects modern critical edition practices. The MuseScore format combines the advantages of a free and easy-to-use software, i.e. consistently typesetting the scores across platforms and systems, with those of a dialect-free XML encoding that affords robust parsing and text-based version control.
As a starting point, existing encodings of the sonatas were collected and converted from various sources and file formats. Since conversion between formats tends to be lossy, we relied, where possible, on existing transcripts in MuseScore format by Lucas Mossman 11 and in Capella format by Tobias Schölkopf 12 (which can also be processed with MuseScore). For the remaining sonatas we converted Craig Sapp's digital edition to musicXML and then to MuseScore format. The only missing sonata, K. 533/494, was typeset by Tom Schreyer specially for this edition.
The converted files were corrected by the professional transcription company tunescribers.com to make them conform to the Neue Mozart Ausgabe with respect to pitch, rhythm, articulation, dynamics, and bar numbering. Thus, the scores' content conforms in many respects to a modern authoritative edition.

Contextual Information and Granularity
The harmonic analyses included in this dataset encode expert knowledge of professional music theorists in a string-based format following a pre defined syntax (see Subsection 3.4). This syntax has been designed such that it is as close as possible to the conventional Roman numeral notation used in many theory textbooks, while being applicable to a broad variety of musical styles (e.g. Baroque, Romanticism, and Jazz) and allowing for a high level of detail to be encoded (e.g. nonchord tones; see Subsection 3.4.2). It is self-evident, however, that a larger vocabulary of chord symbols entails higher analytical contingency (the number of technically correct alternative interpretations) and therefore enhances the potential for inter-annotator disagreement. In the remainder of this section, we will make a case for the possibility of encoding a large variety of harmonic interpretations. The problem of analytical consistency is addressed in Section 4.
Harmonic analysis as taught through prominent textbooks (e.g., Kostka et al., 2013;Laitz, 2015;Clendinning and Marvin, 2016;Aldwell et al., 2011) does not involve merely labelling vertical sonorities in relation to roots and keys; rather, it is heavily informed by recognition of horizontal structures (e.g., suspensions, neighboring motions, sequential patterns, and voice-leading schemata). Although Roman numerals represent vertical entities (chords), they offer the possibility to encode, to a certain extent, horizontal aspects and hence the chord's context, for example by taking into account suspensions, neighbors, or pedal points.
As an example of the horizontal (contextual) aspects that are frequently encoded in Roman numeral analyses, take the excerpt in Figure 1. In mm. 19-24 it was decided to account for the melodic context-a horizontally shifted upper voice-by viewing the first sonority of each bar as a suspension chord. For instance, the first sonority in m. 20 is interpreted as a chord with root vi (E) where the fifth is suspended by a sixth, rather than as the first inversion of a C-major triad, IV6. Bar 22, beat 1 shows a more intricate case: in this specific context, the fifth (A4), though nominally a chord tone, is better to be interpreted as a suspension (indicated by the arrow-like v) of the fourth suspension that is in turn part of the cadential 64 chord. Furthermore, the I[ part of the label in m. 25 marks the beginning of a pedal tone G which continues until the closing ] in m. 32 (not shown). Finally, the example shows that-depending on a researcher's needs-this rather fine-grained way of annotating can also be evaluated on a more coarse-grained level. For example, disregarding all suspensions (within curved parentheses) and grouping labels by their numerals would produce the underlying progression ii6 IV V7 vi ii6 V I; grouping labels by the bass notes they implicitly express would result in C D E C D G (or 4 5 6 4 5 1, when expressed as scale degrees of G major).

Analytic Consistency
Any analytical system crucially depends on the underlying criteria. In the case of a historically grown and wide-spread system such as Roman numeral analysis, the criteria that annotators apply would likely depend on their musical training and may therefore differ. At present, the task of setting up a universally valid, formalized set of analytical criteria that would lead any domain expert to the same Roman numeral analysis regardless of their musical training and of the musical style at hand, is still out of reach. We therefore abandon the idea of one incontrovertible harmonic ground-truth analysis and instead opt for solutions that reflect a consensus between at least two experts under a shared set of guidelines. 13 The guidelines underline the importance of being consistent with one's own analytical decisions throughout a piece, while reviewers are required to ensure the analytical consistency with other annotated pieces and corpora (compare Section 4).

Encoding Temporal Positions
Every annotation label in this dataset is attached to a position in the corresponding score, whether it was created in MuseScore (harmony and phrases) or in an external file (cadences). Temporal positions encoded in XML differ, however, from how musicians and musicologists generally refer to them. In the first place, the positions need to be immediately comprehensible for humans who conventionally use measure numbers (MN) and beats, the latter depending on the given time signature. For a machine, however, MNs are not always sufficient: the same MN may be comprised of several <measure> nodes in an XML encoding, as is the case for divided measures (e.g. Figure 2a) or first and second endings (e.g. Figure 2b). In the present context, the issue of score addressability plays an important role for two reasons. First, we want to present the various facets of the dataset in a uniform way that allows for correctly joining and interrelating them (for an example, see Subsection 5.2.1). Second, we want to enable users of our dataset to automatically remove and add sets of annotations from and to MuseScore files (e.g., inserting the cadence labels that in the first place were not contained in the scores themselves). 14 Both cases require a temporal encoding that unequivocally references positions in the score's XML structure. Therefore, we express every position as a tuple (MC, onset) where MC (measure count) represents a running count of <measure> nodes (independent of length and time signature) which always starts at 1 (this corresponds to the bar counts displayed in MuseScore's status bar). Consequently, the onset part of a position is given as distance from the MC's first event, measured in fractions of a whole note. In other words, the three A's at the beginning of Figure 2a's MN 64 have onset 0, as do the A4 on beat 2 and the grace note B4; beat 2 of MN 65 has onset 1/4. (MC, onset) tuples unambiguously reference positions within XML-encoded scores and can be easily converted to human-readable positions (MN, beat) which depend on time signatures and conventions. In the case of the cadence annotations, which were created using the latter convention (see Subsection 3.3), the beats were converted into fractions of a whole note, which allowed the ms3 parsing library (Hentschel and Rohrmeier, 2020) to infer the (MC, onset) positions that are now stored with them.
Note that by considering the time intervals between positions, any set of positions in a given score also represents a segmentation of it. In that sense, the current dataset also provides various score segmentations, depending on which musical features researchers may want to include in a set: • key regions (derived from harmony annotations; for an example, see Figure 3) • segments between cadences (cadence annotations) • phrases (phrase annotations) • harmonies (harmony annotations) • rhythmic layers (particular selection of note onsets)

Annotating Cadences
The cadence annotations included in this dataset were created in a tabular format, as tab-separated values (TSV). Using a simple text editor, the annotator encoded each label jointly with the corresponding temporal positions. Note that the cadence labels were prepared by the second author independently of, and without considering, the harmonic annotations described in Subsection 3.4. Informal harmonic analysis was but one component guiding the annotation of cadences, accompanied by consideration of melodic, contrapuntal, and (hyper)metric information.
Taking these structural dimensions into account, the cadence analyses adopt a typology that is based on recent music theoretical work (e.g., Caplin, 2004;Neuwirth and Bergé, 2015), operating with five labels. Two main cadence types are distinguished: authentic (perfect and imperfect, i.e., PAC and IAC) and half cadences (HC). The two core strategies for avoiding cadential closure have been labelled as deceptive and evaded cadences (i.e., DC and EC).
Note that the typology used here is tailored to the classical style and hence differs somewhat from those proposed in prominent textbooks (e.g., Kostka et al., 2013;Laitz, 2015;Clendinning and Marvin, 2016;Aldwell et al., 2011). Most importantly, we do not assume plagal and contrapuntal cadences to be genuine cadence types. For more details on this typology, the reader may wish to consult the README file.

Harmony and Phrase Annotations
Using notation software (such as MuseScore) currently provides the most comfortable way of creating, displaying, and manipulating annotations within a single framework, dispensing with a tedious and error-prone manual encoding of the label positions within a score. MuseScore's chord symbol functionality, for example, allows music theorists to quickly navigate and annotate even if they are not particularly computer-savvy (see the example in Figure 1). Using this functionality, three music theorists (Uli Kneisel, 42 movements; Tal Soker, 8 movements; and Adrian Nagel, 4 movements) created the harmonic annotations in this dataset, following the syntax and annotation guidelines developed at the Digital and Cognitive Musicology Lab (DCML). This syntax can  be automatically validated (see Section 4) and encodes a whole range of key aspects of the Roman numeral chord nomenclature, which is the de facto standard for the harmonic analysis of Western tonal music. The data verification was performed directly in MuseScore, and the most recent version of the harmony labels is included in the MuseScore files.
The DCML harmonic annotation standard 15 consists of a plain text syntax that allows for a highly detailed encoding of harmonic interpretations. The harmonic analyses in our dataset provide information on properties of keys, the relation of chords to a given key, chord types, chordal roots, chord tones, non-chord tones, harmonic motion over pedal points, and musical phrases. The features that the standard encodes are listed in Table 1 and explained in the remainder of this subsection. Their combinations follow a predefined syntax that can be parsed using a regular expression.

Encoding Tonal Hierarchy
The way in which chord features are expressed in the DCML harmony annotation standard largely follows music theoretical conventions. One of these conventions is the analysis of chords in terms of a tonal hierarchy. From bottom to top, every chord is expressed relative to a local key, and a local key is in turn expressed relative to a global key in terms of Roman numerals. The resulting local key segments are visualized in Figure 3 as blue lines. In addition, this chart shows the next lower level of the tonal hierarchy, namely the one introduced by applied chords (in red, often subsumed under the term chord borrowing), such as secondary dominants. Direct adjacency of the local tonic that the label applies to is shown in green.

Encoding Chord and Non-Chord Tones
Drawing on a slightly modified Roman-numeral nomenclature, the harmony labels encode chord tones, especially the root, as well as its exact chord type (e.g., diminished triad, major seventh chord) and inversion. Moreover, the DCML standard offers the possibility of annotating non-chord tones such as suspensions, additions, and

Phrase Annotations
The DCML harmony annotation standard can be enriched with a very simple syntax for determining phrase boundaries. It uses the symbols { and } which can be attached to the end of any harmony label or else stand alone. { marks the beginning of a musical phrase and } its ending (e.g., Figure 1, mm. 24f.). The decoupling of beginnings and endings allows annotators to distinguish between (a) cases where two phrases are linked by a small transitory unit which is part of neither and (b) cases of phrase interlocking, where the endpoint of a phrase is also the beginning of the next, annotated as } { (see, for instance, Caplin, 2001). These annotations have been added to the MuseScore files by Adrian Nagel in a separate annotation step, relying on cadential and textural cues.

Data Validation and Verification
In this context, we use the word "valid" for data that is syntactically correct, be it valid XML in the case of scores, or valid strings according to the employed annotation standards and their syntax. The validity of scores is guaranteed by the fact that they can be opened and displayed with the current version of MuseScore 3 without throwing warnings and that they have been successfully parsed with Python's parsing library BeautifulSoup. The validity of cadence labels becomes evident by checking all labels with respect to the predefined vocabulary of the five cadence types. The harmony and phrase annotations have been validated using a regular expression. By "verification" we refer to the process of checking data for semantic correctness (i.e., music theoretical plausibility). In the case of the scores themselves, the main criterion of this process was the correspondence with the Neue Mozart-Ausgabe in terms of content (but not, for example, in page layout). As laid out in Subsection 3.1, this criterion has been ensured by professional typesetters.
When it comes to the annotations in our dataset, we rely on two criteria: (A) Every label has to represent a consensus between at least two theory experts as to which choice best satisfies the annotation guidelines, and (B) analytical decisions need to be consistent within at least one movement. The annotations were verified twice by the first two authors in exchange with the annotators, thus leading to a consensus between three experts. A schematic diagram of the process is shown in Figure 4. Each of the two reviewers checked the entire set of sonata movements. The suggested changes reflected either the correction of an objectively wrong label (e.g., in terms of the notes it represents), the suggestion of a different harmonic interpretation (both pertaining to criterion A), or the correction of an analytical inconsistency (criterion B). For every movement, the changes were then shared with the respective annotator who could either agree or object to each suggestion. The latter case would lead to an exchange of arguments supporting or contradicting each of the two solutions, which would eventually result in a consensus on which label best reflects the discussed aspects, or in the decision to maintain both solutions as alternatives. The procedure is based on the idea of triangulation as a means of data verification (e.g., Flick, 2018) and ideally leads to a consistent, high-quality set of annotations (see the discussion in Subsection 6.2). Figure 4 reveals the procedure's similarity to the widespread git-flow branching model 16 and can indeed be put into practice using a remote Git repository.

Basic Statistical Properties
The dataset consists of expert annotations of all 18 piano sonatas by W.A. Mozart with three movements each. It contains roughly 104,500 notes distributed over 7,500 measures, 15,000 harmony labels (466 types), and 1,100 cadence labels (5 types). Figure 5 shows the distribution of pitch class counts over the whole corpus ordered on the line of fifths, which remarkably conforms to the shape of an almost perfect bell curve centered around G and D. This tallies with previous findings suggesting that the line of fifths is one of the prevailing topological structures underlying musical pitch space (Moss, 2019;Temperley, 2000).
Distinguishing between harmony labels occurring in major (12,700) and minor (2,300) key segments, Figure 4: Data triangulation scheme for verifying a set of expert annotations for a particular composition. Annotator and reviewer(s) share the goal of reaching a consensus on a set of annotation labels that best represents the structural properties of a composition given the predefined annotation principles (guidelines). Consensus is reached through discussions between annotator and one or several reviewers. Taking the common guidelines into account, annotators ensure analytic consistency within a composition by defending their own analytical choices, while reviewers base their suggestions and arguments on how these guidelines have previously been realized across datasets.

Consensus
Reviewer 1 Reviewer 2 Annotator Consensual Annotation Principles (Guidelines) Figure 6 plots the number of chord tokens for every chord type (blue markers), as well as the cumulative fraction of the current and all previous ranks with respect to all tokens (red markers), ordered by rank. The plots show that, out of the 466 different chord types (out of which 79 are shared between major and minor segments), relatively few make up for a large portion of all tokens. Both in major and minor keys, the first five ranks are taken up by the labels I/i, I6/i6, V, V7, and V(64), which together make up for 48.0% in major and 46.3% in minor, while 75.0% of all tokens are covered by the top 15 (major) and top 21 (minor) types. The decrease of label counts with increasing rank roughly shows the shape of a decaying power law, which has been found to be a consistent pattern for frequency distributions of chord labels, pitch class collections, and timbres in Western tonal music of the last three centuries (Zanette, 2006;Rohrmeier and Cross, 2008;Serrà et al., 2012;Moss, 2019), as well as for words in corpora of natural language and for many domains other than music and language (Mandelbrot, 1953;Piantadosi, 2014). Table 2 shows the distribution of cadence labels over the entire dataset and compares it to the one evaluated by Sears et al. (2018), which consists of cadence labels for 50 sonata-form expositions selected from Joseph Haydn's string quartets. The two distributions have a very high positive correlation (r(3) = .975, p = .0017). The fact that Mozart and Haydn were Austrian contemporaries, with their works being interrelated in multiple ways (e.g., Klauk and Kleinertz, 2016), invites further investigation into whether these cadence distributions are representative of (a) genre-bound forms only, (b) these two composers' entire oeuvres, (c) the compositional practice of a particular era in Vienna, or even (d) a larger historico-geographical space.
The heatmaps in Figure 8 show the relative bigram frequencies for the top 25 label types in major on the left, and in minor on the right. Black bars indicate the normalized entropy of a given label's distribution. For example, the V2 has a rather low entropy because in the vast majority of cases (79.1% in major and 85.7% in minor), it proceeds to the same chord, I6 and i6 respectively.
The chart in Figure 9 shows the distribution of the five cadence types over the 54 sonata movements. The quantitative prevalence of the PAC and HC shown there is also evident when considering the across-movement distribution of the labels. All movements contain at least two cadence types; eleven of them contain as a minimal prerequisite only PACs and HCs. In 16 movements, this core is complemented by one of the three remaining types (either EC, DC, or IAC) and only five movements make use of all cadence types (K. 281, II; 284, II; 309, III; 310, III; and 533, III).
The distribution of phrase lengths is shown in Figure 7. It displays peaks for time-spans that are 2, 3, 4, 6, and 8 whole notes in length, and no phrase is shorter than a 4/4 measure.

Terminal Harmonies of Cadences
In a first attempt to combine the cadence and harmony annotations contained in the present dataset, we address the basic question of what harmonies the various cadence types end on, and whether the finding conforms to what one would expect based on textbook knowledge (see Subsection 3.3). Since each cadence label marks the endpoint of a cadence, this task can be accomplished by joining the two sets of annotations together and looking up, for every cadence label, the corresponding harmony label. The results in Table 3 are largely unsurprising: perfect and imperfect authentic cadences invariably end on tonic chords (overwhelmingly major), half cadences on dominant chords (almost exclusively root-position and without a seventh dissonance), and deceptive cadences on chords that have scale degree 6 as a bass note (most frequently carrying the theoretically expected vi chord).
Only the terminal chords of evaded cadences show a much greater variety than what would be expected (cf. Schmalfeldt, 1992): while 46.9% of ECs resemble authentic cadences in that they end on a root-position tonic, 23.5% do not end on a tonic chord at all (but on 12 alternative chords). The high proportion of first-inversion tonic endings (29.7%) in ECs is largely in keeping with prior theoretical expectations. Following up on this finding, it could be examined whether PACs on the one hand and IACs, HCs, and failed cadences (DC and EC) on the other differ primarily with respect to their endings (e.g., Caplin, 2004) or also with regard to their entire harmonic makeup. Related to that is the question of the extent to which cadence types can be predicted based on harmonic information alone, or conversely, whether cadence tokens may be harmonically similar across types (e.g., measured by the amount of shared harmonic vocabulary or information theoretical measures).
Since our dataset combines harmonic and cadence annotations, it can be used to determine potential similarities of cadence instances across the conventional types on the basis of multiple harmonic features, thus enabling scholars to scrutinize the results proposed by Sears (2017b)     Having shown in the previous section how two types of annotations (harmony and cadence) can profitably be combined, this section exemplifies how symbolic annotations can be put in relation to audio data, e.g., the harmonic density of a recording. To that purpose, we correlate the average harmonic densities (measured in labels per minute) of the sonata movements to their tempos (or musical densities, measured in quarter beats per minute). Both measures require actual performance durations for which we use aggregates. The underlying question is whether the speed at which harmony changes in a given piece correlates positively with its overall tempo. An alternative outcome could be that, on the contrary, the harmonic density remains within a certain range, covarying with different factors. Our second experiment started off with two main assumptions, namely that • all measures within the same movement which have the same time signature have the same real-time duration and can be used to approximately calculate the tempo of a piece; 17 • every harmony label in this dataset represents a change of harmony in the music, and the harmonic density of a piece can be expressed by averaging its label count over its typical performance duration.
The median performance durations for the 54 sonata movements were retrieved from the Spotify API. To that end, six complete recordings were selected 18 because their filenames could automatically be matched to the sonata movements and because their durations showed no missing values. Then, the harmony labels had to be unfolded according to the repeat structures of the individual movements so that their counts would reflect the actual musical chronology. With this data, the harmonic density was calculated by averaging the resulting label count for every movement over its median performance duration. In order to approximate the tempo of every movement, we averaged the length of every score over the same performance durations. Considering the diverse time signatures and (unfolded) measure counts-ranging from 40 bars (K. 332, II) to 550 bars (the Presto movement of K. 283)-we opted for a uniform representation of the score lengths expressed as the number of quarter notes that can be fit into one entire rendition (including repetitions), in order to normalize the different measure lengths represented by the various time signatures. Consequently, we call the unit of the computed tempos 'bpm' (beats per minute), although we are dealing with a constant beat size rather than with beats in the metrical sense.
The plot in Figure 10 suggests a strong correlation (r(52) = .80, p = 4.25e-13) between the two sets of values. Normalizing both label counts and beat counts by the same performance durations reveals a clear trend for faster movements to change harmony more quickly than slower movements. The cluster in the lower left part, consisting mainly of blue markers, contains 16 out of the 18 mostly slow middle movements of the corpus (the remaining two are minuets (Menuetto)), which generally have shorter scores but equal or longer performance times than outer movements. But the cluster is also set apart vertically, which seems to suggest that slower harmonic changes are characteristic for the harmony of Adagio and Andante movements.
This approach invites a couple of improvements and ideas for future work. For example, instead of using only the durations from complete sets of recordings of the 18 sonatas, one could opt for a random sampling approach to evaluate more performances and to therefore produce more robust statistics showing the typicality of a given duration. Also, some of the initial assumptions might have to be revisited. For example, the extreme outlier suggesting a tempo of 239 quarter notes per minute is due to the fact that for this particular piece-the first movement of K. 533/494-there seems to be a convention among pianists to repeat the first part of the piece, but not the second (as the score would suggest), which of course reduces the performance duration. Further investigations might also refine the idea of what makes for a change in harmony. As shown in Subsection 3.2.1, labels can be grouped into larger units, thus reducing the label counts. Furthermore, it might Tempo (normalized beats per minute) Harmonic density (labels per minute) slope = 0.48 prove beneficial to apply linear mixed effects models in order to evaluate the effects of such a treatment. This would also be a useful approach for quantifying confounding factors such as human psychology which might lead annotators to analyze slow movements differently, or for shedding light on hidden patterns occurring in the disposition of harmonic densities between adjacent movements.

Heterogeneous Data in a Unified and FAIR Format
The publication of this dataset adheres to the FAIR principles of Open Science, given that the data and associated metadata is • findable, because it has been published online, and entered into various data registries; • accessible, namely by granting unrestricted access to the repository through an Open Access publication with an attached digital object identifier (DOI); • interoperable, because it uses text-based formats (TSV and XML) exclusively and presents its various facets in a unified tabular format; • and reuseable since the accompanying Python script allows to flexibly select, join, and transform parts of it, and because the metadata has been enhanced with identifiers from Wikidata, 19 VIAF, 20 MusicBrainz, 21 and IMSLP. 22 One limitation of the dataset is that the scores come in a single format. Although MuseScore can be used to export the scores to musicXML, such exports cannot be consistently imported, currently not even by MuseScore itself. In the future, it would be desirable to further increase reusability by publishing additional validated files in widely used formats, such as musicXML or MEI. Without the availability of lossless conversion tools, however, this would immensely increase curatory efforts. Nonetheless, providing score information in a tabular format that can be produced using a publicly available parser can be viewed as a valuable alternative.

An Alternative Procedure for Verifying Expert Annotations
Since the advent of crowdsourcing platforms such as MTurk (Buhrmester et al., 2011), the quality assessment of subjective annotations is a much-researched topic (e.g., Nguyen et al., 2016;Kutlu et al., 2020). Two persisting problems, however, are (1) the opacity of the analytical criteria employed by crowd annotators (especially when using disparate chord vocabularies) and (2) the question of how to assess the quality of annotation sets in which many labels do not coincide (for example in the case of diverging analytical granularities, see Subsection 3.2.1).
In Section 4, we therefore introduced an alternative way of ensuring the quality of annotations. It is based on the ideas of a standardized chord vocabulary, transparency and consistency of annotation principles across annotators and pieces, and achieving analytical consensus between two or more experts. The facts that both annotators and reviewers rely on the same (publicly available) annotation guidelines and that their names are known, situate the annotated Mozart sonatas within the best practices of the Open Science philosophy (Vicente-Saez and Martinez-Fuentes, 2018).

Toward More Music-Theoretically Informed Annotations
Most existing chord annotation standards are tailored to the description of vertical sonorities, be it those relying on traditional Roman numeral analysis (e.g., Huron, 2020;Temperley and de Clercq, 2013;Cambouropoulos, 2016;White and Quinn, 2016;Chen and Su, 2018;Tymoczko et al., 2019), or on absolute ("guitar") chords (e.g., Burgoyne et al., 2011;Harte, 2010;Broze and Shanahan, 2013;Choi et al., 2016). With this publication we want to make a case for harmonic annotations that try to overcome some of the shortcomings of the solely vertical perspective by including voice-leading and other horizontal contexts such as suspensions, retardations, neighbouring motions, and organ points (see Subsection 3.2.1). The DCML harmonic annotation standard provides experts with a more expressive syntax, allowing them to consistently encode contrapuntal sequences, schemata, and other voice-leading techniques that heavily inform harmonic analysis. Similarly, the cadence typology used for our corpus depends on many more criteria than those explained above (see Subsection 3.3), including voiceleading, (hyper)meter, and form. Cadence annotators trained in the outlined "Caplinian" tradition likely apply these criteria implicitly in their analyses, but it requires the quantitative study of a large dataset to shed light on them empirically. Since harmonic progressions and voiceleading patterns can be realized in an intractable multitude of musical surfaces (for the case of cadences, see Rohrmeier and Neuwirth (2015)), it is difficult to define or enumerate annotation criteria and rules beforehand without falling back to ad-hoc principles. Large datasets of annotations reflecting the intricate decisions and intuitions of expert analysts therefore represent an important step toward the development of comprehensive formalized models of music and their application in the field of MIR.