Dagstuhl ChoirSet: A Multitrack Dataset for MIR Research on Choral Singing

Choral singing is a central part of musical cultures across the world, yet many facets of this widespread form of polyphonic singing are still to be explored. Music information retrieval (MIR) research on choral singing benefits from multitrack recordings of the individual singing voices. However, there exist only few publicly available multitrack datasets on polyphonic singing. In this paper, we present Dagstuhl ChoirSet (DCS), a multitrack dataset of a cappella choral music designed to support MIR research on choral singing. The dataset includes recordings of an amateur vocal ensemble performing two choir pieces in full choir and quartet settings. The audio data was recorded during an MIR seminar at Schloss Dagstuhl using different close-up microphones to capture the individual singers’ voices. In this article, we give detailed insights into all stages of creating DCS: recording process, data preparation, generation of annotations as well as development of suitable interfaces for publicly accessing and reusing the data. Furthermore, we demonstrate the potential of the dataset for MIR research by discussing case studies on choral intonation assessment and multiple fundamental frequency (F0) estimation.


Introduction
Choral singing is one of the most widespread types of polyphonic singing (Sundberg, 1987). For instance, the European Choral Association 1 reports over 37 million amateur and professional choir singers on the European continent, while Chorus America 2 reports 54 million active singers in the U.S. The great interest in choral singing motivates the need for MIR technologies to support singers and conductors in their rehearsal practices  via mobile applications 3,4 and webbased interfaces. 5 Over the last years, there has been an increasing number of MIR techniques developed for analyzing polyphonic vocal music (Dai and Dixon, 2017;Cuesta et al., 2018;Devaney, 2011;Devaney and Ellis, 2008;Howard et al., 2013;Howard, 2007;Weiß et al., 2019) as well as for synthesizing expressive singing Blaauw and Bonada, 2017). Essential to the development of such techniques is the availability of suitable datasets and processing tools. In particular, multitrack recordings are of great value for evaluation purposes. However, due to high demands on recording equipment and infrastructure, there exist only few publicly available multitrack datasets on polyphonic vocal music.
The lack of suitable research data was one of the driving motivations to create Dagstuhl ChoirSet (DCS), a publicly available multitrack dataset of a cappella choral music for MIR research (cf. Figure 1). The audio data was recorded during a one-week research seminar on "Computational Methods for Melody and Voice Processing in Music Recordings"  at Schloss Dagstuhl. 6 For the recordings, we assembled a vocal ensemble of mostly amateur singers (all were participants of the Dagstuhl seminar) covering different SATB (Soprano, Alto, Tenor, and Bass) voice sections. After several rehearsals with a conductor, we recorded multiple takes of two choir pieces in a full choir setting and two quartet settings (Quartet A and Quartet B). Furthermore, we recorded some systematic exercises for practicing choral intonation. As one main feature of the dataset, individual singers were recorded using multiple close-up microphones, including larynx, headset, and dynamic microphones. Subsequent to recording and curating the recorded multitrack data, we annotated beat positions and generated time-aligned score representations for each of the music recordings. Furthermore, we automatically extracted F0-trajectories for all close-up microphone signals. The publicly available dataset is archived on Zenodo 7 and is accessible via an interactive web-based interface with score-following and playback functionality. 8 In order to facilitate reproducibility and further research using this dataset, we have created an open source Python toolbox with helper functions to load, parse, and process dataset files. 9 In summary, our annotated dataset has different musical and acoustical dimensions that open up a variety of research scenarios. Besides being a good basis for studying amateur choral singing, DCS constitutes a challenging scenario for various fundamental tasks in MIR such as automatic music transcription (Benetos et al., 2019), score-to-audio alignment (Thomas et al., 2012), and beat tracking (Zapata et al., 2014;Böck et al., 2019). Moreover, the close-up microphone signals as well as the available F0-trajectories and scores can serve as a baseline to research on (informed) source separation techniques (Cano et al., 2019(Cano et al., , 2014. Furthermore, it allows for comparisons between multiple choir/quartet performances, choir settings, and microphone types.
The remainder of this article is structured as follows. In Section 2, we give an overview on datasets related to our work. In Section 3, we describe DCS by providing details on the choir settings, selected pieces, technical setup of the recordings, and generated annotations. In Section 4, we explain the different interfaces to access and use the dataset. In Section 5, we demonstrate the relevance of this dataset for MIR research by conducting two case studies on choral intonation assessment and multiple F0-estimation using state-of-the-art algorithms. Finally, in Section 6, we summarize our contributions and experimental results.

Prior Work
There is an urgent need for datasets in the field of MIR: annotated data are crucial for training data-driven systems or evaluating methods developed to solve specific tasks. Over the last years, the availability of suitable datasets has triggered research on tasks such as melody extraction (e.g., MedleyDB (Bittner et al., 2014)), music style identification (e.g., Ballroom dataset (Gouyon et al., 2004)), and automatic chord recognition (e.g., Beatles dataset (Harte et al., 2005)).
The datasets closely related to DCS are presented in Table 1. Su et al. (2016) created a small dataset for research on choral music. It consists of five short excerpts of Western choral music, ranging from 18 to 40 seconds in length. The dataset contains stereo audio recordings and note event annotations, annotated by a professional pianist. Although small in size, this dataset is relevant for multiple-F0-estimation in complex scenarios where sources are similar, (e.g., voices of a choir), and where several sources produce the same notes (i.e., unisons).
Over the last years, there has been an increasing interest of the MIR community in analyzing world music (Serra, 2014;Panteli, 2018), including traditional singing (van Kranenburg et al., 2019). A conceptually similar dataset to DCS in terms of recording methodology and utilized microphones is a set of multitrack field recordings of threevoice Georgian vocal music (Scherbaum et al., 2019). The dataset includes 216 songs recorded with video cameras, portable stereo recorders as well as multiple close-up microphones attached to each of the singers. Furthermore, the Erkomaishvlili Dataset is a publicly available corpus based on historic tape recordings of three-voice traditional Georgian songs performed by the former master chanter Artem Erkomaishvili . The dataset includes digital sheet music, F0-and onset annotations of the three voices as well as annotations of the overdubbingbased recording structure.
In the context of Western polyphonic vocal music, we find very few multitrack datasets. Two examples are datasets from a commercial application that have been used by McLeod et al. (2017): the Barbershop Quartets 10 and the Bach Chorales. 11 Both datasets contain separate tracks for each of the four SATB singers and an additional track with a stereo mix. The Barbershop recordings comprise 22 songs with a total length of 42 minutes, whereas the Bach Chorales contain 26 recordings with a total length of 58 minutes. The audio recordings and the accompanying synchronized MIDI files are not freely available.
The Choral Singing Dataset (CSD) (Cuesta et al., 2018) is a publicly available dataset of Western polyphonic vocal music. 12 The CSD consists of multitrack recordings of three SATB choral pieces: Locus Iste by Anton Bruckner, Niño Dios d'Amor Herido by Francisco Guerrero, and El Rossinyol, a popular Catalan song, performed by a small choir of 16 singers. The four singers of each choir section were recorded simultaneously in the same room with individual handheld dynamic microphones. However, the different sections were recorded separately where a MIDI track served as reference. The recording length of the three songs is around seven minutes. Furthermore, the CSD includes synchronized MIDI files, note annotations per choir section, and F0-annotations. In summary, the CSD is most similar to our dataset in terms of musical aspects. Further similarities and differences of the CSD to our dataset are discussed in Section 3.3.

Dagstuhl ChoirSet
In this section, we describe all components of DCS. In Section 3.1, we give details on the choir settings as well as the recorded pieces and exercises. Then, we explain the recording setup of the multitrack recordings in Section 3.2 and discuss the different dimensions of DCS in Section 3.3. Subsequently, we elaborate on the manually created beat annotations in Section 3.4. Furthermore, we provide details on the time-aligned score representations in Section 3.5. Finally, we describe the automatically extracted F0-trajectories in Section 3.6.

Choir Settings and Musical Content
In total, 13 singers (Dagstuhl seminar participants) took part in the recording session. All singers have provided their consent to publish the recorded material for research purposes under a Creative Commons license. The Full Choir consisted of two sopranos, two altos, four tenors, and five basses. From the Full Choir, we selected two soloistic SATB quartets (Quartet A and Quartet B) with four different singers each. The singers had diverse musical backgrounds (from hobby musicians to such holding a music degree) as well as varying levels of experience in (choir) singing within different musical genres. These experiences ranged from singers who had never sung in a choir before to a professional singer with many years of training. Considering that the singers had not sung in this constellation before the Dagstuhl seminar and had only few rehearsals together (3 sessions of roughly 1 hour length), the recorded choir and quartets may be representative of an amateur choir level, with individual skills partly exceeding that level. Rehearsals and recorded performances were also conducted by a Dagstuhl seminar participant, who is a professional composer with solid experience in conducting semiprofessional choirs, orchestras, and big bands. We recorded two pieces as well as several intonation exercises with the full choir and the two quartets. The central piece of DCS is Anton Bruckner's Locus Iste (WAB 23) in Latin language. Figure 2 displays the first eleven measures of the piece's score obtained from the Choral Public Domain Library (CPDL). 13 This small choir piece of approximately three minutes' duration is musically interesting, containing several melodic and harmonic challenges such as chromatic parts and covering a large part of each voice's tessitura (S: B3-G5, A: G3-B4, T: C3-E4, B: F2-C4). Beyond that, the piece is part of the CSD (Cuesta et al., 2018) (see Section 2), thus allowing for interesting comparative studies across datasets. Furthermore, we selected the piece Tebe Poem by the Bulgarian composer Dobri Hristov. 14 Both pieces are written for SATB choirs in four parts. In addition to these two pieces, the dataset contains a set of vocal exercises of different difficulties and forms taken from the book Choral Intonation (Alldahl, 1990). The exercises include scales, long and stable notes, chords, cadences, and a variety of intonation exercises. The additional recordings are potentially interesting to study aspects of ensemble singing such as interval intonation, F0-agreement in unison singing, and intonation drift in a cappella performances.

Multitrack Recordings
During the recording session, which took place in a Dagstuhl seminar room, we recorded multiple takes of the different pieces and settings. An overview of the recorded material in DCS is presented in Table 2. The reported durations refer to the accumulated durations of all takes for a specific piece and setting (not counting multiple tracks per take). The different choir settings were recorded using multiple microphones. In order to record the overall performance, we used an ORTF stereo microphone (Schoeps MSTC 64 U) spaced ca. 3 m away from the singers. The recorded stereo microphone signal is referred to as STM signal in the following. Furthermore, we used several close-up microphones to record individual singers. The recording setup for one singer, which is illustrated in Figure 3, includes a handheld dynamic microphone (Sennheiser MD421 II), a headset microphone (DPA 4066F), and a larynx/throat microphone (Albrecht AE 38 S2a). In the following, we abbreviate the three microphone types as DYN, HSM, and LRX respectively. LRX microphones have shown to be beneficial for analyzing voices of individual singers in polyphonic vocal music (Scherbaum et al., 2015(Scherbaum et al., , 2018. Being attached to the skin at the human throat, LRX microphones nicely capture the pitch of the singing voice. Furthermore, compared to other conventional microphones such as DYN microphones, LRX microphones are robust to environmental noise, e.g., the voices of neighbouring singers. However, due to the missing contributions of the vocal tract, LRX signals primarily serve as analysis signals. To illustrate the microphone differences, magnitude spectrograms of LRX and DYN microphone signals for a tenor singer in a quartet setting are shown in Figure 4a. The shown excerpts correspond to the marked Locus Iste passage in Figure 2. It can be observed that the LRX signal is cleaner than the DYN signal. This becomes evident especially in Part II (middle part of the marked passage), where the solo bass voice leaks more strongly into the DYN signal than into the LRX signal of the tenor.
For our recordings, we had four DYN, three HSM and eight LRX microphones available. The complete setup as shown in Figure 3 could only be used for three singers-other singers were equipped with two, one, or no individual microphone(s). Note that we distributed the microphones such that at least one singer of each part was captured with one LRX and one DYN microphone. The microphone signals were recorded using one RME Fireface UFX audio interface, two 8-channel RME Micstasy A/D converters, and the Digital Audio Workstation (DAW) Logic Pro X running on an Apple MacBook Pro (see Figure 5). Furthermore, we created an additional reverb version of the stereo microphone signal using the ChromaVerb plug-in in Logic Pro X with a decay time of 2 seconds. After recording, all tracks were exported from the DAW and subsequently cut according to manually set cut points using the tool PySox (Bittner et al., 2016). PySox is an open source library that provides a Python interface to SoX (Sound exchange), 15 a command line tool for sound processing. The cut tracks are available in DCS as monophonic WAV files with a sampling rate of 22050 Hz.

Dagstuhl ChoirSet Dimensions
DCS offers different musical and acoustical dimensions, which are summarized in Table 3. We refer to the dimensions as Song, Setting, Take, Voice, and Microphone. The  The multiple dimensions of DCS make it unique when compared to related datasets such as the CSD (Cuesta et al., 2018). The main differences between the CSD and DCS lie in the Setting, Take, and Microphone dimensions. The CSD includes one singer setting, a single take per song and one microphone type. Furthermore, the CSD choir sections were recorded separately, while all singers were captured at the same time in DCS. The different recording setup in DCS enables studies on interactions between sections. However, as opposed to the Full Choir setting in DCS, the recorded choir in the CSD is larger and balanced in the   In order to account for the variety of different dimensions, we developed a filename convention for all audio and annotation files included in DCS. The general format of the filenames is the following (cf. .wav refers to the audio signal from the larynx microphone (LRX) of the second tenor (T2) in the Full Choir setting (FullChoir) during the second take (Take02) of Locus Iste (LI). Note that the files with microphone shortcut STM contain a mono mix of the left and right channel of the stereo microphone.

Manual Beat Annotations
The beat is a key unit of the temporal structure of music (Goto and Muraoka, 1997). As stated by Robertson (2012), when beat annotations are manually generated by tapping along to an audio signal, they reflect the ability of the annotator to produce the beats rather than their perception. In such cases, the produced beat annotations can be subsequently refined by iteratively listening and modifying them according to perceptual cues. Following this premise, we generated beat annotations for all STM signals of Locus Iste and Tebe Poem in a two-stage process: in the first stage, annotations were manually created by an annotator with some musical background. The annotation by tapping feature in Sonic Visualiser (Cannam et al., 2010b) was used for this task. Sonic Visualiser is an open source software for generating manual annotations of various kinds. In the second stage, annotations were reviewed and refined by a second, experienced annotator using the same software.
These beat annotations are provided as comma-separated value (CSV) files with two columns. The first column contains timestamps in seconds, whereas the second column contains beat and measure information provided as floating point numbers to three decimal places. The part in front of the decimal point encodes the measure number. The part after the decimal point indicates the beat position inside the measure. For example, in 4/4 time, each beat is represented as an increment of 1/4 = 0.250, and therefore the beat positions are given as 1.000, 1.250, 1.500, 1.750, 2.000, 2.250, 2.500….

Time-Aligned Score Representations
In order to obtain a musical reference for the different performances of Locus Iste and Tebe Poem, we aligned MIDI representations of the pieces to the STM signals using the beat annotations from Section 3.4. The MIDI files were obtained from the CPDL (see Section 3.1). For synchronization, we used the dynamic time warping pipeline from Ewert et al. (2009) andMüeller et al. (2004) that uses the beat annotations as anchor points for the alignment. In order to facilitate data parsing and processing, we converted the aligned MIDI files to CSV files using pretty_midi (Raffel and Ellis, 2014), a Python library for processing and converting MIDI files. For each STM signal, DCS contains one separate CSV file per section (as opposed to MIDI files that include all sections). Each CSV file contains three columns, which represent note onset in seconds, note offset in seconds, and MIDI pitch. The number of rows is equal to the number of notes in the piece.

Fundamental Frequency Trajectories
One of the most important cues for computational studies on choral singing and choral intonation are the F0-trajectories of the individual singers' voices (Cuesta et al., 2018;Dixon, 2017, 2019). However, annotating F0-trajectories from polyphonic mixtures is cumbersome and requires a lot of labor-intensive work. We exploit the multitrack nature of DCS to automatically compute the F0-trajectories of each singer from the close-up microphone signals using two state-of-the-art algorithms for monophonic F0-estimation: pYIN  and CREPE (Kim et al., 2018).
The pYIN annotations were obtained using the pYIN Vamp Plug-in 16 for Sonic Annotator (Cannam et al., 2010a). For pYIN, we used an FFT size of 2048 and a hop size of 221 samples, which corresponds to around 10 ms for a sampling rate of 22050 Hz. We used the algorithm in the smoothedpitchtrack mode, which uses a hidden Markov model (HMM) and Viterbi decoding to smooth the F0-estimates. In addition, we configured the plugin to output negative F0-values in frames that are estimated as unvoiced (outputunvoiced=2) as well as the probability of each frame to be voiced (output=voicedprob). For CREPE, we used the CREPE Python package 17 with the model capacity set to full, Viterbi smoothing activated, a default hop size of 10 ms, and a default input size of 1024 samples. Similar hop sizes were used with both methods for an easier comparison. The F0-trajectories are stored in CSV files with three columns. The first two columns contain the timestamps in seconds and the F0-values in Hz. In the case of pYIN, the third column contains the probabilities of the frames to be voiced. In the case of CREPE, the third column contains the confidence as provided by the algorithm. The confidence is a number between 0 and 1 that indicates the reliability of an F0-estimate.
In order to validate the automatically extracted F0-trajectories, we generated manual F0-annotations for all voices of two quartet recordings based on the LRX signals. The annotations were made by a sound engineer with over ten years' training on saxophone using the tool Tony (Mauch et al., 2015) and are included in DCS as CSV files. For evaluation, we use common evaluation metrics for melody extraction as detailed by Poliner et al. (2007); Salamon et al. (2014). The metrics Voicing Recall (VR) and Voicing False Alarm (VFA) measure the accuracy of the algorithm's voice activity estimation. The metrics Raw Pitch Accuracy (RPA) and Raw Chroma Accuracy (RCA) measure the proportion of frames for which the estimated F0-trajectory lies within 50 cents (half a semitone) of the reference (RCA ignores octave errors). Additionally, the Overall Accuracy (OA) is a combined metric that accounts for both voice activity and F0-accuracy. We use the open source toolbox mir_eval  to compute the evaluation metrics. In our experiments, we derive the voice activity for F0-trajectories extracted by CREPE by choosing a confidence threshold that maximizes the overall accuracy. The evaluation results averaged over the two recordings for pYIN and CREPE (8 LRX, 6 HSM, 8 DYN trajectories per algorithm) are given in Tables 4 and 5, respectively. The standard deviations are given in brackets. Both algorithms perform most accurately on the LRX signals (0.93 of overall accuracy), slightly less accurate on DYN signals and least accurate on HSM signals. This is expected, since the F0 of the voice is more dominant in LRX signals than in DYN or HSM signals (see Section 3.2). The overall performance of both algorithms is similar on LRX and DYN signals and deviates for HSM signals, where CREPE performs better than pYIN.
In the following, we further analyze the differences between the microphone signals. Figure 4 illustrates the F0-trajectories from a tenor singer extracted from LRX and DYN signals using CREPE. The CREPE confidence values are depicted in Figure 4b. For visualization purposes, the confidences are smoothed with a median filter of length 210 ms. Thresholding the smoothed confidence values with a threshold of 0.935 for the LRX confidence and a threshold of 0.9 for the DYN confidence leads to the binary activations depicted in Figure 4c and the F0-trajectories depicted in Figure 4a. Note that the thresholds are chosen exemplarily to show the differences between the microphones. In Part I, CREPE shows similar confidence values for both microphone signals when the tenor is singing. Part II shows significant differences between the two microphones. In this part, low confidence values are expected since the tenor is not active. Still, CREPE shows some confidence for both microphone signals due to cross-talk of the bass voice. However, one can find a suitable threshold for the LRX confidence to avoid an F0-output. Since the crosstalk is much stronger in the DYN signal, there exists no meaningful threshold that suppresses any F0-output in Part II of the DYN signal. In Part III, the F0-trajectory of the DYN microphone suffers from confusions with the bass voice even though the tenor is singing.

Dagstuhl ChoirSet Interfaces
The main goal of our work is to create a freely available and easy-to-access dataset in order to support MIR research on a cappella choral music. To this end, we provide several interfaces to interact with the dataset. As the most important step, we make the dataset publicly available in order to support scientific exchange and ensure reproducibility of scientific results. We decided to host DCS on Zenodo, 7 an Open Science platform, which supports sharing and distributing scientific data. As main features, the platform provides versioning and citeable Digital Object Identifiers (DOIs) for uploaded data.
However, Zenodo is a data repository and does not offer to play back the audio files in the browser. The interdisciplinary field of MIR benefits from interfaces that help to lower access barriers to datasets by providing direct, intuitive, and comprehensive access. This can be accomplished by means of interactive interfaces, e.g., with playback functionalities (Gasser et al., 2015;Jeong et al., 2017;Röwenstrunk et al., 2015). As one contribution, we created a publicly accessible web-based interface, 8 which hosts the multitrack audio data. The entry page of the interface is subdivided into a "Music Recordings" section providing links to the Locus Iste and Tebe Poem recordings as well as a "Systematic Exercises and Additional Recordings" section. Furthermore, the interface allows for searching and sorting of specific recordings. Each multitrack recording has an individual sub-page with an open source audio player (Werner et al., 2017) with score-following functionality (Zalkow et al., 2018) that allows for seamless switching between the different tracks.
Accompanying dataset-specific processing tools simplify the usage of datasets (Bittner et al., 2014(Bittner et al., , 2019. Therefore, we created a Python toolbox named DCStoolbox 9 that accompanies the release of the dataset. The toolbox provides basic functions to parse and load data from DCS, which are demonstrated in a Jupyter notebook. Additionally, the toolbox includes scripts to reproduce the computed F0-trajectories from Section 3.6 and an Anaconda 18 environment file that specifies all Python packages required to run the toolbox functions.

Applications to MIR Research
In this section, we demonstrate the potential of DCS for MIR research by means of two case studies. In the first case study discussed in Section 5.1, the goal is to evaluate and compare the intonation quality of quartet performances using a recently published intonation measure (Weiß et al., 2019). In the second case study, conducted in Section 5.2, we consider the task of multiple F0-estimation. More specifically, we apply a state-of-theart approach (Bittner et al., 2017) on different recordings and show the benefits of our multitrack recordings for multiple-F0-estimation in polyphonic vocal music.

Intonation Quality of Quartet Performances
A central challenge for a cappella singers is the adjustment of pitch in order to stay in tune relative to the fellow singers. Even if choirs achieve good local intonation, they may suffer from intonation drifts slowly evolving over time (Devaney, 2011). Algorithms that attempt to measure intonation quality have to account for such intonation drifts. A recently published approach measures the distance between the recording's local salient frequency content and a shifted 12-tone equaltempered (12-TET) grid (Weiß et al., 2019). Although choirs often aim for just intonation, the 12-TET scale has been used to approximate intonation in Western choral performances (Gnann et al., 2011). The intonation measure requires as input the F0s and harmonic partials (integer multiples of the F0) together with their respective amplitudes for the four singing voices. In a frame-wise fashion, a grid-shift parameter is computed that minimizes the distance between the F0s partials and the shifted 12-TET grid. As output, the approach returns a frame-wise intonation cost (IC) that reflects the remaining distance from the optimally shifted 12-TET grid. The IC is bounded in the interval [0, 1], where small values indicate good local intonation, and large values indicate local intonation deviations. In the following, we use this approach to compare the performances of Quartet A and B in our DCS. Weiß et al. (2019) show that multitrack recordings of the individual voices are beneficial for estimating the frequency and amplitude information required to compute the IC. For our case study, we make use of the recorded LRX and DYN signals as follows. We obtain the frequency information from the extracted pYIN F0-trajectories of the LRX signals (see Section 3.6). Using the time-aligned score representations from Section 3.5, we restrict the trajectories to regions where the respective voices are active. We obtain the amplitude information from a magnitude spectrogram representation of the DYN signals at the locations of the extracted LRX F0-trajectories and their harmonic partials. In our experiments, we consider 16 harmonic partials. Subsequently, we compute IC measure curves for all quartet recordings of Locus Iste in DCS. In order to compare the different takes, we map the curves on a common time axis in measures using the measure information encoded in the beat annotations from Section 3.4.
The averaged IC curves for six recordings of Quartet A and five recordings of Quartet B are depicted in Figure 6. To remove local outliers, we post-process the IC curves using a moving median filter of length 21 frames. Note that the IC is zero for silent regions and small for monophonic passages where only one singer is active (see measures 12, 20/21, and 43). Overall, the curves exhibit a similar progression. For both curves, we observe higher IC values in the passage from measures 13 to 20. This passage is challenging to sing due to the highly chromatic voice leading and the jumps in the bass part. Furthermore, the passage from measure 40 to 42 exhibits higher intonation costs for both quartets-a passage which is highly chromatic. The largest differences between the quartets can be found in the last part of the piece (measures 44 to 48). For this passage, Quartet B achieves a better intonation quality on average than Quartet A, especially in the intonation of the final chord of the piece.
This short case study indicates the potential of our recordings for studying intonation in polyphonic a cappella music. Furthermore, our data can form a starting point for future studies on singer interaction in amateur choirs.

Multiple-F0-Estimation in A Cappella Singing
Multiple-F0-estimation is defined as the task of estimating the F0s of several concurrent sounds in a polyphonic signal (Klapuri, 2008(Klapuri, , 2006. This task is particularily challenging for polyphonic vocal music Su et al., 2016). In a cappella choral singing, we find multiple singers with similar timbres singing in harmony, thus producing overlapping harmonics (Cuesta et al., 2019). Furthermore, it is very common that several singers sing the same part (unison), but produce slightly different frequencies. However, MIR research on multiple-F0-estimation in polyphonic vocal music has so far been focusing on SATB quartets and there exist no suitable methods for multiple-F0-estimation in larger ensembles with multiple singers per part. The Full Choir recordings in DCS constitue a starting point for further research in this direction.
In the following, we show the potential of DCS by applying a state-of-the-art multiple-F0-estimation algorithm on different scenarios offered by the DCS quartet recordings. The first scenario consists of applying the algorithm on a mix of all DYN signals. In the second and third scenario, the algorithm is applied on the STM signal (room microphone) with and without additional reverb. In particular, we consider the recordings of Locus Iste from Quartet A (Take 3).
In our case study, we use the DeepSalience method (Bittner et al., 2017), a deep convolutional neural network trained to produce a pitch salience representation (enhanced time-frequency representation) of the input signal, which contains values in the range [0, 1]. This salience representation is thresholded such that only timefrequency bins with a salience value above the chosen threshold remain. These remaining bins correspond to the multiple-F0-estimates. Although the model is not specifically trained for polyphonic vocal music, it was found to obtain the best performance for multiple-F0-estimation in vocal quartets (Cuesta et al., 2019). For the evaluation, we exploit the multitrack nature of DCS. In particular, we take the previously extracted pYIN F0-trajectories from the LRX signals as reference (see Section 3.6). Note that these trajectories are the output of an algorithm. Although our evaluation reveals they are very accurate (see Table 4), they still contain some errors. As evaluation metrics, we use the standard multiple-F0-estimation metrics Precision, Recall, and F-Score. For a detailed description of these metrics, we refer to Bittner (2018, Chapter II, Section 6.3). The evaluation metrics were computed using the mir_ eval library .
We experimented with several thresholds between 0.05 and 0.5, and found 0.1 to obtain the best results on the studied quartet recordings with respect to our evaluation metrics. However, instead of comparing absolute values (which is problematic for automatically extracted reference F0-trajectories), we want to focus on relative differences between the different scenarios. Figure 7a shows excerpts of the computed multiple-F0-estimates for the mix of DYN signals and the STM signal with reverb obtained by thresholding the salience representations with a threshold of 0.1. Figure 7b shows the evaluation results for all three scenarios. From the F-Score values, we observe that the algorithm performs best for the DYN signal mix of Quartet A. Furthermore, we observe that an increasing amount of reverb in the recordings goes along with a decreasing overall performance of the algorithm. This indicates that reverb further complicates the task of multiple-F0-estimation. The Precision and Recall measures give further insights into this observation. While Precision is lower in the scenario with reverb, Recall is not affected. In reverb conditions, sung notes become temporally smeared, leading to a temporal mismatch between the reference F0-trajectories from the LRX signals and the audio recording. For this reason, the number of false positives increases, causing Precision to decrease. This effect can be seen by comparing the red marked areas in Figure 7a. We leave a more detailed analysis of these effects to future studies. In summary, this brief case study indicates that the DCS is a versatile and challenging resource to develop and test algorithms for multiple-F0-estimation in polyphonic a cappella vocal music. Furthermore, the time-aligned score representations could serve as a reference for the evaluation of note-tracking algorithms. This requires accounting for intonation drifts of the choirs, which can, e.g., be determined from the F0-annotations.

Conclusions
In this paper, we presented Dagstuhl ChoirSet-a publicly accessible multitrack dataset of a cappella choral music for MIR research. This work is based on our recordings of an amateur vocal ensemble we gathered at an MIR seminar at Schloss Dagstuhl. As main feature of the dataset, the singers were recorded using different close-up microphones including dynamic, headset, and larynx microphones. As part of our work, we curated the recorded material and manually generated beat annotations as well as time-aligned sheet music representations. Furthermore, we automatically extracted F0-trajectories for all close-up microphone tracks. The dataset is released together with an interactive web-based interface and a Python toolbox to provide convenient access. In summary, the different musical and acoustical dimensions of DCS open up a variety of new and challenging scenarios for MIR research.