Music Tempo Estimation: Are We Done Yet?

With the advent of deep learning, global tempo estimation accuracy has reached a new peak, which presents a great opportunity to evaluate our evaluation practices. In this article, we discuss presumed and actual applications, the pros and cons of commonly used metrics, and the suitability of popular datasets. To guide future research, we present results of a survey among domain experts that investigates today’s applications, their requirements, and the usefulness of currently employed metrics. To aid future evaluations, we present a public repository containing evaluation code as well as estimates by many different systems and different ground truths for popular datasets.


Introduction
The estimation of a music recording's global tempo is a classic Music Information Retrieval (MIR) task. It is often defined as estimating the frequency with which humans tap along to the beat (Scheirer, 1998;Dixon, 2001). In contrast to beat-tracking (Allen and Dannenberg, 1990;Goto and Muraoka, 1994) or local tempo estimation (Peeters, 2005), successful global tempo estimation requires the existence of a stable tempo as often occurs in Rock, Pop, or Dance music. To conduct a basic evaluation of a global tempo estimation system one needs the system itself, test recordings with globally stable tempo, suitable annotations, and at least one metric. Starting with the work of Goto and Muraoka (1994) and Scheirer (1998), the MIR research community has been conducting such evaluations for 25 years. Acknowledging the importance of making results comparable, the first systematic evaluation with a defined set of metrics and datasets was conducted in 2004 (Gouyon et al., 2006). One year later, the 2005 Music Information Retrieval Evaluation eXchange (MIREX) (Downie, 2008) established an automatic tempo extraction task, which has been conducted almost every year ever since. Through both the datasets and metrics established in 2004 and for MIREX, we have seen global tempo estimation systems improve and have been able to track their performance. In the meantime, new datasets have been published and another large-scale evaluation has been conducted (Zapata and Gómez, 2011), but neither applications nor metrics have been fundamentally questioned or updated. This is why recent near-perfect MIREX results (Böck et al., 2015;Schreiber and Müller, 2018b) beg the question: are we done yet?
In this work, we critically discuss the evaluation of global tempo estimation systems. We do so based on the idea that applications lead to use cases that define who the users are, how they use the system, in what context and for what purpose . The combination of these elements determines the success criteria to evaluate systems and judge whether the task is indeed solved Sturm, 2016). This kind of evaluation also allows us to acquire new knowledge and advance the field (Serra et al., 2013, p. 31), if experiments are followed by interpretation of results, learning, system improvement, and eventually re-evaluation or even re-definition of the task or the evaluation methodology (Figure 1). This is referred to as the research cycle Sturm, 2016). For it to succeed, we need to be able to conduct analyses of all parts of the evaluation process: task definition, data, metrics, systems, and analysis. As has been pointed out before Sturm, 2013a;Raffel et al., 2014;Salamon et al., 2014), this disqualifies evaluation campaigns with private or secret data and closed source evaluation code. Evaluation itself must follow the same cycle of learning. How we evaluate must be analyzed, questioned, and improved Serra et al., 2013, p. 33). Do datasets and metrics match current use cases? Are there recordings for which no system estimates the correct tempo, or recordings most systems estimate different tempi for? Does that mean the annotation is wrong, the tempo is hard to estimate, or the recording is not suitable for the task? To become aware of and address these issues, we need versioned annotations and publicly archived estimates. A step in this direction was taken by Böck et al. (2019), by publishing annotations and estimates as supplemental material.
We start our investigation in Section 2 with discussion of the relevance of tempo estimation in light of presumed and actual applications (in the general sense, not referring to a specific software program). In Section 3, we review popular metrics with emphasis on construct validity. Then, in Section 4, we present the results of a survey among domain experts, which aimed at finding out which applications are important to them and how they measure success. Based on these results, we propose the formal octave error as a complementary metric in Section 5. Then, in Section 6, we discuss size, quality, composition, and suitability of popular datasets. In Section 7, we propose a public repository for reference annotations, estimates, and evaluation code to help with future evaluations. Finally, in Section 8, we draw conclusions.
Throughout this article we will illustrate some observations with tempo estimates produced by three systems: perc Tzanetakis, 2014), böck (Böck et al., 2015), 1 and schr (Schreiber and Müller, 2018b). They were chosen for illustrative purposes, their conceptual differences, and availability, not because they necessarily represent the state of the art.

Applications
Even though tempo estimation is a well established MIR task, the existing research rarely discusses in depth why tempo estimation is relevant and what the application requirements are. Dixon (2001) identifies four main application types for his work on tempo and beat extraction: performance analysis, perceptual modeling, audio content analysis for retrieval, and performance synchronization. Most applications described in later work fall into these four broad categories. Alonso et al. (2003) mention automatic rhythmic alignment of audio, indexing for retrieval, and synchronized computer graphics. Peeters (2007) explicitly adds automatic playlist generation, DJ applications like beat-mixing and looping, and further beat-synchronous analysis (e.g., cover song identification, Ellis and Poliner (2007)). Tzanetakis and Percival (2013) list applications such as music similarity and recommendation, semiautomatic audio editing, automatic accompaniment, polyphonic transcription, beat-synchronous audio effects, and computer assisted DJ systems. Böck et al. (2015) add to this the contribution tempo estimation can make to beat-tracking, such that beats are aligned to a previously estimated tempo. Elowsson and Friberg (2015) consider tempo annotations useful for automated mixing, e.g., for beat-synchronous delay and compressor release settings. Similarly, Font and Serra (2016) mention remixing and browsing as potential applications.

Research Justifications
In publications focused on new methods, most application descriptions serve a motivational purpose justifying the conducted research. In fact, even though some of the mentioned applications not only require tempo, but also phase information (e.g., beat-synchronous delay), they all stem from publications primarily (but not necessarily exclusively) about tempo estimation. To the best of our knowledge, no formal application survey for tempo estimation has ever been conducted. Therefore we simply do not know how relevant tempo estimation is for any of the mentioned applications and what requirements these applications have. Rephrased in terms of commercial engineering: for the past 25 years we have largely ignored the customer. As Salamon (2019) recently observed, 'There is a disconnect between MIR research and potential users of MIR technologies.' This is not to say that the MIR community has conducted the wrong kind of research. After all, it is the privilege of basic research to not require an immediate application, and prefacing each scientific project with a market study is not expedient. But as tempo estimation and MIR as a whole mature, one might want sound justifications as to why and for what research is conducted.

Presumed Applications
We would like to illustrate the issue with two presumed applications of tempo estimation: similarity and recommendation (Tzanetakis and Percival, 2013;Percival and Tzanetakis, 2014;Böck et al., 2015). By definition, two recordings with the same tempo are similar-at least in this respect. But since similarity has many facets, tempo cannot be the only feature used to predict it. It may not even be very important. In fact, in their introduction to music similarity Knees and Schedl (2016) briefly mention tempo, but do not deem it important enough to thoroughly discuss it. To quantify how important tempo estimation is for music similarity, we counted the number of MIREX submissions for the similarity task that used tempo as a feature. 2 Many submissions used low-level temporal or rhythmic features, but only 8 of 62 (13%) explicitly used a single beats per minute value. One team even removed tempo as feature in a subsequent submission. 3 Music recommendation is another application mentioned when justifying tempo estimation research. But is tempo estimation really useful for recommendation? Content-based systems certainly can take advantage of tempo annotations (Vignoli and Pauws, 2005), but to the best of our knowledge this is not a common approach. Slaney (2011) points out that recommendation based on collaborative filtering usually outperforms content-based systems, if enough usage data is available. Merely in coldstart scenarios (e.g., lack of usage data) does contentbased recommendation play a noteworthy role. Schedl et al. (2018) report that if content-based recommendation is attempted, ' almost all existing approaches rely on a number of predefined audio features that have been used over and over again, including spectral features, MFCCs, and a great number of derivatives.' This does certainly not exclude tempo, but in their report on current challenges for music recommender research tempo is never mentioned. Therefore, we conjecture that global tempo estimation is only of marginal importance for general similarity and recommendation. It may however still play a role when it comes to specific similarity or recommendation tasks, for example in the context of ballroom dances or physical exercise.

Actual Applications
On the positive side, there are plenty of existing applications that are very similar to those stated in the literature. Tempo estimation has been used in computational ethnomusicology (Cornelis et al., 2013). Life science researchers who study connections between exercise and music tempo (Waterhouse et al., 2010) and athletes who want to control the tempo of their workout naturally benefit from tempo estimation systems. Consumer applications like beaTunes (https://www.beatunes.com/) provide this information via offline analysis, and streaming services like Spotify (https://www.spotify.com/) or Deezer (https://www.deezer.com/) offer playlists with narrow BPM ranges made for runners. The music store BeatPort (https://www.beatport.com/) labels all its tracks with global BPM and key values to help DJs when shopping. And when performing, DJs can take advantage of tempo analysis and beat-tracking/matching features of their DJ software (e.g., Traktor, https://www.native-instruments. com/). Thus useful applications exist, even though they are typically not the result of user studies or other requirements gathering processes by the MIR community.

Metrics
The exemplary evaluation during the 2004 ISMIR conference effectively established the accuracy metrics ACC 1 and ACC 2 as standards. Few subsequent publications explicitly discuss the musical concept of global tempo. Instead, researchers seem to assume that measuring ACC 1 and ACC 2 is identical to measuring global tempo. De facto, the metrics have become the task definition (Salamon, 2019). The only popular alternative is the P-Score metric.

Accuracy 1 and 2
ACC 1 computes a 0 or 1 score per track, which indicates the correctness of an estimate, allowing a 4% tolerance. This tolerance is described as 'somewhat arbitrary' (Gouyon et al., 2006). It was not chosen because someone defined an application that required a certain precision, but because it was assumed that the test tracks have ' approximately constant tempi.' This may have been a good choice for traditionally produced music, but seems lenient for electronic music or music produced with modern production techniques like click tracks (Lamere, 2009), and strict for Romantic piano pieces. Attempting to justify the tolerance, Gouyon et al. (2006) argue that according to Friberg and Sundberg (1995) the Just-Noticeable Difference (JND) for music tempi is approximately 4% and therefore '4% is probably the highest precision level that should be considered. ' We unfortunately see problems in this argument. First, Friberg and Sundberg's experiment measured whether participants were able to perceive the non-isochronous placement of the fourth tone in a sequence of six tones. But instead of 4%, they actually found an average JND of 2.5% for tracks with tempi between 60 and 250BPM. Secondly, and more importantly, it is not conclusively explained how this experiment relates to determining the tempo of a 30s sample, as was the task during the ISMIR 2004 contest. We therefore do not believe that the results of the experiment are suitable to derive the ACC 1 tolerance parameter. In fact, when plotting ACC 1 for the tempo estimation systems böck, schr, and perc with different tolerances (Figure 2), we see that all three systems are capable of estimating tempo for Ballroom (Gouyon et al., 2006) tracks with almost the same accuracy at 2% tolerance as they are at 4% tolerance. That said, for datasets with less stable tempi, 4% may be too strict.
This points to issues inherent to binary metrics. The threshold is usually arbitrary, because it cannot be derived in an indisputable, objective way. Furthermore, it hides information. ACC 1 does not tell us how wrong an estimate is, nor in which direction. This means that we cannot easily plot an error distribution or other descriptive statistics. ACC 1 is also blind to small systematic errors below the threshold. At the same time, it may overemphasize  perc schr böck differences between systems. As an extreme example, systematic errors of +4.01% and +3.99% may not differ much, but their ACC 1 scores could not be further apart. Specifying the tolerance for ACC 1 in percent may also be questioned. Assuming a fictional tolerance of 50%, a recording may be estimated half as fast, but not twice as fast. Contrary to that, estimating a triple meter recording at half its tempo is arguably less appropriate than at twice its tempo (Elowsson and Friberg, 2015). ACC 2 additionally allows estimates to be wrong by the factors 2, 3, 1 2 or 1 3 (so-called octave errors). This metrical tolerance was not motivated by application requirements either, but by the realization that the used annotations may not match the perception of human listeners. Unfortunately, because the meter is not taken into account, ACC 2 counts some perceptually erroneous estimates as correct (Gouyon et al., 2006). Consequently, Elowsson and Friberg (2015) regard it as 'inappropriate.' Another limitation of ACC 2 is that it says nothing about a system's ability to help a user to distinguish between slow and fast tracks. This reduces this metric's usefulness for applications like playlist generation based on tempo continuity or when searching for slow music (Peeters and Flocon-Cholet, 2012). Gärtner (2013) states: 'From the perspective of the user of DJ software, it is absolutely mandatory that the tempo is annotated correctly. The so-called octave errors are unacceptable.' This mismatch between metric and usefulness illustrates that the construct validity  of ACC 2 , i.e., the correlation between use case, success criteria, and the employed metric, is far from perfect for the mentioned use cases.

P-Score
A metric that takes tempo ambiguity into account and treats it as an inherent property of music (Moelants and McKinney, 2004) is the P-Score proposed by Moelants and McKinney for the MIREX audio tempo extraction task in 2005. 4 The original metric incorporated two metrical levels as well as a phase estimate, and considered an estimation system's salience estimation. In 2006 it was simplified to: Where each track is annotated with two reference tempi, T1 and T2, and T1's relative perceptual strength ST1 ∈ [0, 1]. T1, T2, and ST1 are the result of an expensive process involving many annotators per track. To calculate a P-Score, TT1 ∈ {0, 1} encodes the ability of an estimation system to identify T1 with a tolerance of 8% as boolean value, 0 or 1. TT2 ∈ {0, 1} is defined correspondingly. 5 In addition to the P-Score, 'One Correct' and 'Both Correct' percentages are published for systems participating in MIREX. Because P-Score accounts for ambiguity in human perception and does not reward perceptually erroneous estimates, it is an improvement compared to ACC 2 , but still has shortcomings. We were unable to find any formal justification for the used 8% tolerance. According to McKinney, 'the tolerance was derived empirically through the evaluation of a number of excerpts, algorithms and studies. It is somewhat arbitrary […].' 6 Furthermore, since 2006 the metric does not require an estimation system to assign a salience value to its two estimates per track. 7 This means that an application using a system with a perfect P-Score still has to guess which of the two estimates is the more salient one. Just like ACC 2 , P-Score does not test the ability of a system to distinguish between slow and fast. It also is not efficient in the sense that it is relatively expensive to create the necessary ground truth. This might explain why only one other suitable dataset (Schreiber and Müller, 2018a) has been created since the original MIREX dataset in 2005, which itself had been created for an experiment about the perception of tempo and not for MIREX.

Survey
To better answer some of the questions raised regarding applications and metrics, we have conducted a survey among domain experts who work or have worked on tempo estimation. In this section, we are highlighting the most important results. Details with graphical depictions are shown in Appendix A and the raw data is available as supplemental material.
Of the 24 individuals who filled out the questionnaire, 17 (71%) belonged to academia and 7 (29%) to the industry. Most participants identified themselves as researchers (92%), and a majority claimed to be involved in hands-on algorithm implementation (71%). We were surprised to learn that, according to participants, none of the usually mentioned applications is most important to them, but to produce 'input for other algorithms.' While ' other algorithms' may include 'recommendation' and 'similarity', neither of these two options was explicitly chosen by any participant. The second most important application is 'performance synchronization.' Participants from the industry tend to focus their tempo estimation efforts much more on particular genres than those from academia. To industry, the danceable genres Ballroom, EDM/Disco, Hip Hop/Rap, and Reggae are most important (in that order). Classical is only ranked fifth. Contrary to this, those members from academia who target specific genres, ranked Classical first, followed by EDM/Disco, Pop/Rock, and Hip Hop/Rap. Ballroom and Folk were not ranked at all by academics. We speculate that this difference may be related to the respective group's motivation. Academia has already reached very good results for Ballroom music (Böck et al., 2015), which makes it uninteresting, while Classical music might still be seen as a challenge and may appear as more interesting from a musicological point of view. In contrast, the industry is not primarily driven by interestingness, and typical industry applications-like DJ software-focus on dance, not classical music.
Being able to distinguish slow from fast tracks is very important for the applications of most participants. This appears to be a central requirement. A strong majority of industry applications (71%) also seem to need a single BPM value rather than a tempo distribution. In academia, this is only true for 57%. Both groups see ACC 1 as a very useful metric when it comes to measuring how well an application meets its requirements or how well a research objective is achieved. For ACC 2 the picture is less clear. While the industry leans toward 'useful,' members of academia gave answers covering almost the entire possible spectrum. This supports our criticism from Section 3.1 regarding the construct validity of ACC 2 . When asked about the usefulness of P-Score, the two groups were of very different opinion. Most members of the industry tend to regard P-Score as 'not useful,' while many academics see it as ' essential.' This reveals a big divide between evaluation for industry applications and the scientific evaluation at MIREX.
The survey documents that many industry members are interested in more accurate tempo values than are tolerated by ACC 1 or ACC 2 . Among them, the most often demanded accuracy was '2 decimal places.' This must be seen in the context of target genres, which for the industry are more oriented towards dance music, which typically has a very stable tempo. The most popular choice among academics was 'Other.' Here, free-form answers ranged from 'BPM with as small as possible tolerance' over 'no specific application yet' to ' depends on the dataset and the accuracy of the annotations.' The second most popular choices among academics were 'nearest integer' and '2% tolerance.' Regardless of affiliation, no one chose 8%-the tolerance traditionally used at MIREX. 8 Lastly, while a strong majority (73%) of all participants still regard global tempo estimation as a relevant MIR task, only 57% of industry members believe so. Some of the stated doubts are: 'tempo estimation is good enough for most industrial use cases', 'local tempo estimation is a much more useful task', and 'beat tracking, as a more general task than tempo estimation, solves all problems.'

Formal Octave Error
We have argued in Section 3 that the tolerances of ACC 1 , ACC 2 , and P-Score are difficult to justify and that the binary nature of these metrics hides information. Furthermore, using a percentage as threshold is sub-optimal, and the survey results indicate that there is interest in metrics with lower tolerance, up to an ' as small as possible tolerance. ' We therefore propose a complementary metric that measures how close and in which direction an estimate is to a reference value. Inherently, such a metric supports meaningful visual depiction of error distributions. Gouyon et al. (2006) and Peeters (2007) have used such a metric, by showing the log 2 of the ratio between estimates and reference values in histograms. Following them, we formally define the octave error OE 1 as ( ) with y, ŷ ∈ ℝ >0 as ground truth and estimate. OE 1 is designed to highlight the most important error class, octave errors, in an intuitive way. Errors by factors k and 1 k have the same magnitude, which means that in an OE 1 visualization the octave errors 2, 1 2, 3, and 1 3, are easily identifiable as clusters around 1, -1, 1.58, and -1.58. Figure 3a shows examples for OE 1 distributions for Ballroom rendered as violin plots. 9 Clearly visible is the concentration around -1 tempo octaves (TO) for all systems but böck, schreiber2017 (Schreiber and Müller, 2017), and schr. None of the systems suffer much from the relatively rare octave errors 3 or 1 3 (Peeters, 2007;Schreiber and Müller, 2017). The extent of the horizontal spread of the concentrations around 0 TO visualizes nonoctave errors. OE 1 distributions can serve as indicators for the overall performance of a global tempo estimation system including the capability to help distinguish Based on values measured for Ballroom using a median ICBI-derived ground truth created from beat annotations by Krebs et al. (2013). Ordered by year of publication (Scheirer, 1998;Klapuri et al., 2006;Davies et al., 2009;Oliveira et al., 2010;Gkiokas et al., 2012;Percival and Tzanetakis, 2014;Schreiber and Müller, 2014;Böck et al., 2015;Müller, 2017, 2018b). Estimates for zplane and echonest stem from Percival and Tzanetakis (2014).
between slow and fast. Most importantly, one can see at a glance what kind of errors the tested systems are prone to. We have seen in our discussion of P-Score that taking tempo ambiguity into account is desirable, but that suitable datasets are rare and new datasets are expensive to create. Furthermore, P-Score has not been adopted by the industry (Section 4). For these pragmatic reasons, we do not attempt to solve the metrical level problem, but define OE 2 similar to ACC 2 as OE 2 (Figure 3b) measures accuracy on a micro level, where the most common errors on the metrical level are ignored, i.e., it measures how close the estimate is to the nearest related tempo. 10 This is useful for genres with high tempo ambiguity, e.g., Dubstep (Schreiber and Müller, 2018a), and for applications that require errors to be as small as possible. The latter is a use case currently unsupported by ACC 1 and ACC 2 , but desired by the industry (Section 4). While the mean of OE 1 or OE 2 indicates whether an algorithm is expected to over-or underestimate the tempo, the absolute octave error (AOE = |OE|) can be used for system comparisons. To illustrate, Figure 3c shows annotated AOE 1 -distributions. Most older systems have an average AOE 1 between 0.3 and 0.4TO, böck managed to halve this figure, and schr further reduced it to 0.056TO. When ignoring octave errors by using AOE 2 (Figure 3d), we can see that böck and schr perform on a similar level.
Note that though the mean AOE is informative, we recommend also reporting a distribution for a more complete picture.

Datasets
Evaluations of tempo estimation systems rely on datasets consisting of suitable recordings and annotations that model what we want to measure. Without claim to completeness, Table 1 lists popular tempo datasets. Unfortunately, some of these datasets are relatively small, focus on a particular genre, are not freely available (any more), or have other flaws like duplicates, mislabelings, and distortions (Sturm, 2013b(Sturm, , 2014Salamon, 2019).

Dataset Size
To reliably measure differences between systems, a dataset must be sufficiently large to minimize the effect of random variation due to the sampling of tracks it contains. Generalizability Theory (GT) offers a statistical tool to estimate the required size for performance assessments in general (Cronbach et al., 1963;Brennan, 2003;Bodoff, 2008;Carterette et al., 2009;Salamon and Urbano, 2012;Bosch et al., 2016). Essentially, the GT framework decomposes the variability in the observed scores into variability due to actual differences between systems ( 2 s s ), variability due to differences in track difficulty ( 2 t s ), and residual variability ( 2 e s ), which often refers to system-track interactions. The total variance of the observed scores is therefore modeled as:  There are several coefficients in GT, but here we will report only the dependability index Φ ∈ [0, 1], which measures the ratio of system variance to itself plus error variance (Brennan, 2003): where M is the size of the dataset. A high Φ-value means that the dataset can reliably separate actual differences among systems from random variation due to sampling of tracks. Φ-values greater than 0.95 are generally considered high enough, but because this is rather arbitrary we focus more on qualitative comparisons among datasets and metrics. We estimated Φ through an Analysis of Variance (ANOVA) for the datasets ISMIR04 Songs, Hainsworth, GTzan, Ballroom, SMC, RWC (here, the union of RWC-C, RWC-G, RWC-J, RWC-P, and RWC-R), and GiantSteps Tempo (Figure 4, a-g) using scores from five different systems (Davies et al., 2009;Percival and Tzanetakis, 2014;Böck et al., 2015;Müller, 2017, 2018b), closely following the approach described in Salamon and Urbano (2012). Figure 4 shows Φ as a function of the number of songs M, which lets us determine how many songs would be necessary for a reliable evaluation. The actual number of songs in the respective dataset is indicated by a vertical and the 0.95 reliability level by a horizontal dotted line. In other words, for a large enough dataset, Φ should pass through the upper left quadrant (colored in pale orange). Using this criterion, only ISMIR04 Songs, Ballroom, and GiantSteps Tempo, are large enough to reliably differentiate system performance for the tested algorithms when using ACC 1 or ACC 2 . In all cases but GiantSteps Tempo, both OE 1 and AOE 1 lead to similar or better Φ -values than ACC 1 , i.e., we reach a greater reliability level for the given dataset. In fact, all seven tested datasets are large enough to reach the 0.95 threshold when using OE 1 as metric. For OE 2 and AOE 2 the picture is not quite as clear-for some datasets, like GTzan and Ballroom, they reach higher Φ -values than ACC 2 , for others, like SMC, lower values.
In Figure 4h, we show an evaluation of the MIREX dataset (McKinney et al., 2007) based on the published MIREX 2018 results. 11 'One Correct' reaches Φ = 0.95, P-Score reaches Φ = 0.92, but 'Both Correct' only Φ = 0.67, this means that the MIREX dataset is close to being large enough for P-Score but certainly not for 'Both Correct.' Note that all reported Φ -values depend on the tested systems. Removing older, worse performing systems from the evaluation may actually lower the Φ -value. Serra (2014) states that among other aspects quality is an important criterion when creating research corpora. The audio has to be of high quality and annotations have to   be accurate. Ten years after Ballroom had been used for the first time, Percival and Tzanetakis (2014) investigated the accuracy of the annotations and corrected 32 (4.6%) of them. Corrections were also made to ACM Mirum (135, 9.6%) and GTzan (24, 2.4%). Interestingly, Percival and Tzanetakis emphasize the importance of using correct annotations, because testing systems on faulty data may lead researchers to optimize for these errors. This fear might be indicative for the state of MIR at the time. Machine learning was not ubiquitous yet and tuning hyperparameters using the test set was not perceived as quite the methodological faux-pas it is seen as now. But there are other good reasons to strive for quantifiable quality in test datasets: interpretability and comparability. If the quality of a test dataset is unknown, a metric like accuracy can at best be used to approximate the lower bound of a system's true performance. At worst it is simply useless. It is impossible to say whether any changes to the system can still increase performance. Additionally, it is impossible to compare results for different datasets in a meaningful way, if the dataset quality is unknown. Schreiber and Müller (2018a), for example, noticed the fairly low ACC 2 performance of state-of-the-art tempo estimation systems on the original annotations of the GiantSteps Tempo dataset, and conducted a crowdsourced experiment to create a new ground truth. When comparing the performance of böck on the original annotations with the performance in the new annotations, ACC 1 jumps from 58.9% to 64.8% and ACC 2 from 86.4% to 94.0%.

Modeling Global Tempo
It is well known that some of the tracks in popular datasets have varying tempi (Hainsworth, 2004;Peeters, 2007;Percival and Tzanetakis, 2014). To address this issue, Hainsworth defined the tempo for the tracks in his dataset as the mean of the Inter-Beat Intervals (IBI). Percival and Tzanetakis (2014) suggested using the median instead, to counter the influence of outliers-an idea already used by Peeters (2007) and Oliveira et al. (2010). Böck et al. (2015) followed this suggestion, but to the best of our knowledge did not publish their annotations. Subsequent publications still used the original mean-based annotations (Schreiber and Müller, 2017) or tempo values obtained in some other way. For example, Elowsson (2016) derived tempi from the peaks of smoothed IBI histograms.
In addition to changing tempi, some datasets (Hainsworth, 2004;Marchand and Peeters, 2015) contain recordings with microtiming variations. One may argue that for such recordings neither the mean nor the median IBI is an ideal solution, because the beats are not necessarily isochronous. As a result, one may see multiple peaks in an IBI histogram. For example, the IBI-based BPM histogram for the GTzan recording jazz.00053 (Figure 5a) shows distinct peaks at 186, 190 and 201BPM even though the tempo of the track does not change over time. Choosing the median of the IBIs (200.7BPM) ignores the lower peaks at 186 and 190BPM. If we know a track's meter, we therefore may rather use the median of the intervals between corresponding beats, i.e., the intervals between beats that occur at the same position in subsequent measures divided by the number of beats per measure. Using this Inter-Corresponding-Beat Interval (ICBI) for tempo calculation, we can neutralize effects of variations in microtiming as well as outliers (Figure 5b).

Dataset Suitability
While improving and versioning annotations is commendable, it does not ensure that the dataset fits the use case. Obviously, if the use case focuses on Ballroom, using a Reggae dataset for testing is the wrong approach. Similarly, if a metric is chosen that was designed for a certain use case, which may imply a certain kind of music, one must ensure that it is suitable for the actually used kind of music ( Figure 6). As pointed out above, a precondition for using ACC 1 and ACC 2 with 4% tolerance is a stable tempo in each test track. We can visualize whether this precondition is met for a dataset by converting IBIs to normalized tempi and plotting their distribution. Concretely, given a track's IBIs b = {b 0 , b 1 , …, b N-1 } in seconds with b n ∈ ℝ >0 and the   (Figure 9a), all three systems reach higher scores at τ = 0.1 than for greater τ. Comparing ACC 2 for τ = ∞ to τ = 0.1, accuracy increases for böck by 18.4 pp, for schr by 11.5 pp, and for perc by 10.0 pp. For Hainsworth (Figure 9b) the systems also achieve higher scores at τ = 0.1, but not as much in absolute numbers. For GTzan (Figure 9c) the increase is still a little smaller, and for Ballroom (Figure 9d) there is none, because almost all tracks have small c var (t). This relationship between τ and ACC 2 reveals that of the four datasets only Ballroom is suitable for   perc schr böck ACC 2 (and thus ACC 1 ) without reservations, because it meets the required degree of stability.

Public Repository
To help overcome issues like opaque one-figure evaluations with binary metrics, differently derived annotations, closed source evaluation code, and the inability to evaluate the evaluation, we have created a public GitHub repository called tempo_eval (https://tempoeval. github.io/tempo_eval/) that hosts different versions of corpus annotations (Section 7.1), estimates for these corpora (Section 7.2), and evaluation code that goes beyond single figure binary metrics. It provides a basis for the needed collaborative improvement of data and metrics. Section 7.3 demonstrates how the repository can be used for evaluation.

Reference Annotations
The tempo_eval repository allows the continuous improvement of reference annotations without shadowing past versions. This makes it possible to evaluate against all reference versions, improving comparability to older published results and thus transparency as well as interpretability. To provide easy access to reference data we converted published annotations to JAMS  for which tools already exist (Raffel et al., 2014).

Estimated Annotations
Rather than just serving as a static source of reference data, the tempo_eval repository offers a place for researchers to publish and archive their algorithms' estimates instead of just mentioning single value metrics in their publications. This allows re-evaluation with new and old reference annotations and proper development of new metrics, which may ultimately lead to a better understanding of tempo estimation systems and the tempo estimation task.
For example, Figure 3 shows values for a proposed metric (Section 5) for historic estimates measured against a ground truth, which has been newly derived from median ICBI-values. Because the repository is open and public, contributing is easy, e.g., via pull requests. As a starting point, we have added estimates by many recent and classic systems for commonly used datasets.

Evaluation Code
The tempo_eval repository also contains evaluation code. Implemented are ACC 1 , ACC 2 , and P-Score, along with McNemar's test for significant differences for ACC 1 and ACC 2 , OE 1 , OE 2 , their corresponding absolute incarnations, and t-tests for estimates from algorithm pairs. Results can be rendered in a publishable report (Markdown/HTML), and figures and data are exportable in several formats. As argued above, reporting single value metrics is not sufficient for an in-depth evaluation. We have therefore implemented visualizations for system performance depending on tolerances (Figure 2), tempo stability (Figure 9), tempo range (Figure 10), and-if available-genre-or free-form-tags ( Figure 11). As an example, we will discuss tempo-and genre-dependent evaluation using the Ballroom dataset with annotations from Percival and Tzanetakis (2014). Figure 10a shows ACC 1 values for subsets defined by tempo ranges [T -10, T + 10] BPM. Clearly visible, perc's ACC 1 drops to zero for T > 150BPM, and böck's ACC 1 sharply decreases to 27.3% or less for T > 190BPM. Both systems seem to exhibit some form of octave bias (Schreiber and Müller, 2017), i.e., the ability to estimate the tempo appears tied to certain tempo ranges.  T > 190BPM. None of the systems seem to do well for tracks with T < 66 BPM or T > 225BPM, but as we can see in Figure 10b, the dataset contains only very few songs in these tempo ranges. Figure 10d combines error magnitude, error direction, and significance in a single graph. It shows the predictions and their 95% confidence interval of generalized additive models (GAMs) fitted on the respective systems' OE 1 results. A large confidence interval indicates tempo regions with few samples or large variability in performance. In Figure 10d this can be seen for less than 75BPM (few tracks), around 150BPM (performance starts to shift), and for more than 210BPM (few tracks, low performance). Because JAMS supports additional annotations like genre, tags, and beat positions, these can be incorporated into the evaluation. For example, Figure 11a shows OE 1 distributions by genre. Mostly due to -1TO octave errors, perc does poorly on Jive, Quickstep, and Viennese Waltzthe three genres with the highest average tempo. böck faces the same issue with Quickstep. This is noteworthy, because Jive, Quickstep, and Viennese Waltz combined make up almost 30% of the Ballroom dataset, as shown in Figure 11b (light-blue bars).
Note that evaluation by ballroom genre is just an example. The code picks up on any JAMS annotation declared in the tag_open namespace.

Conclusions
In this article we asked the question whether the task of global tempo estimation is solved yet. To find out, we investigated what applications global tempo estimation is used for, discussed currently used metrics, analyzed popular datasets with emphasis on tempo stability and size, and presented the results of a survey among domain experts. We found that applications and use cases for global tempo estimation are somewhat ill-defined, the binary nature of ACC 1 and ACC 2 is problematic and the metrics are not suitable for some use cases, the constructvalidity of ACC 2 is questionable, the industry has not adopted P-Score, and that some currently used datasets are too small or do not have a tempo that is stable enough for ACC 1 and ACC 2 . Because of these issues, our answer to the opening question, whether the task of global tempo estimation is solved yet, is no. Not because estimation systems are not good enough-we do not really know whether that is the case or not, but because it is impossible to solve a task for which neither use cases with success criteria have been well motivated and properly defined, nor the suitability of metrics or datasets has been shown.
Going forward, we need to recognize that global tempo estimation is a task serving different possible applications, each with its own accuracy requirements. Performance synchronization may need as accurate a tempo estimate as possible, while a general musicological interest, playlist building, or some other downstream algorithm may only require a rough estimate or tempo markings like andante and allegro. Actually achievable accuracy depends on tempo stability, on how tempo is modeled, and annotations are derived. ACC 1 and ACC 2 with their fixed 4% tolerance do neither different accuracy requirements nor tempo stability levels justice. We therefore recommend using the complementary OE metrics, which do not suffer from this limitation and deliver meaningful results for music with different degrees of tempo stability. If reporting ACC 1 and ACC 2 is a necessity, one might also want to plot results for tolerance ranges (Figure 2). In accordance with the industry and despite its popularity among scholars, we see no practical use for P-Score, until larger datasets with the required annotations become available.
Almost regardless of metric or use case, we recommend not to use Hainsworth or the combined RWC datasets. Even though technically an evaluation with OE 1 is possible, they are too small for metrics that allow easy summarization like ACC 1 or AOE 1 . Because of its borderline size, we also do not recommend GTzan. Due to its large tempo instabilities and small size, the SMC dataset should probably only be used to evaluate for low accuracy use cases using OE 1 or AOE 1 , if at all. Of the tested datasets, we endorse using ISMIR04 Songs, Ballroom, and GiantSteps Tempo, if appropriate for the use case. To ensure comparable evaluations, we suggest using open source code like mir_eval or tempo_eval. All estimates and used annotations should be published, to improve reproducibility of the evaluation. The tempo_eval repository is meant as a home for this. Since annotations often exist in different versions, we explicitly warn against comparisons with accuracy figures reported by others.