Structural Segmentation of Alap in Dhrupad Vocal Concerts

Dhrupad vocal concerts exhibit a temporal evolution through a sequence of homogeneous sections marked by shared rhythmic characteristics. In this work, we address the segmentation of a concert audio’s unmetered improvisatory section into musically meaningful segments at the highest time scale. Motivated by the distinct musical properties of the sections and their corresponding acoustic correlates, we compute a number of features for the segment boundary detection task. Both supervised and unsupervised approaches are tested using a dataset of commercial performance recordings that is manually annotated. The dataset is augmented suitably for training and testing of the models to obtain new insights about the relevance of the different rhythmic, melodic and timbral cues in the automatic boundary detection task. We also explore the use of a convolutional neural network trained on mel-scale magnitude spectrograms for the boundary detection task to observe that while the implicit musical cues are largely learned by the network, it is less robust to deviations from training data characteristics. We conclude that it can be rewarding to investigate knowledge driven features on new genres and tasks, both to achieve reasonable performance outcomes given limited datasets and for drawing a deeper understanding of genre characteristics from the acoustical analyses.


Introduction
Musical structure refers to the 'grouping', or the manner in which music is segmented, at a whole variety of levels from groups of a few notes up to the large-scale form of the work (Clarke, 1999). The relationships are created by the temporal order, repetition, homogeneity or contrast of musical aspects. Music structure analysis from audio is an important topic of research in Music Information Retrieval (MIR). However, much of this research has been restricted to Western or popular music cultures and does not generalize easily due to the high dependence of musical structure characteristics on the culture and genre.
In this work, we study the structural segmentation of concerts of the North Indian vocal genre, Dhrupad. In particular, we investigate unsupervised and supervised methods for the detection of structural boundaries in the elaborate improvised section of the concert known as the alap. Given that the genre has received little attention in MIR, even though considerable musicological scholarship is available, we test existing automatic methods for structural segmentation while exploring new approaches motivated by the characteristics of the music tradition. A top-down system design involving musicology and higherlevel culture-specific perspectives can also provide new insights about performance practice over that possible with purely data-driven methods (Serra, 2011).
Audio segmentation is crucial for MIR applications like fast navigation, finding repetitive structure in music or even for the task of music transcription (Klapuri et al., 2001). Metadata supplied with commercial CDs or performance audios on the internet provides information about the musicians and, possibly, about the number and durations of the constituent music pieces. However, information about the section boundaries within a piece is rarely specified. Segmentation can also facilitate other more complicated tasks like section labelling, audio thumbnailing that extracts short representative clips (Bartsch and Wakefield, 2005), and music summarization that stitches these thumbnails to aid rapid browsing Peeters, 2003). An Indian art music concert typically involves a solo performer with accompanying instrumentalists. It is generally extempore in nature and lasts for a long duration, even up to a few hours. The overall concert structure stems from distinct sections of specific musical characteristics, organized hierarchically and lasting several minutes each, with the approaching section boundaries cued by the performer in various ways. The identification of section boundaries would therefore be a major step towards rich transcription for pedagogy and musicology research (Widdess, 1994).
The present work addresses the automatic detection of the major section boundaries in Dhrupad alap performance recordings using musicologically motivated representations of the acoustic cues to section change. Primarily intended to explore the potential of current MIR techniques to a musicologically interesting task in an under-researched genre, this exercise brings out the role of diverse musical cues in modeling section boundaries. We present a brief overview of the music tradition followed by a literature review of music segmentation methods. In Section 2, we present our dataset and annotation procedure. Acoustic features associated with known and observed musical characteristics are discussed in Section 3. Segmentation approaches, experiment conditions and models are explained in Section 4 and the results of boundary detection from the different approaches presented in Section 5. Finally, Section 6 summarizes the work and points to some future tasks.

Music Background
Music exhibits a rich structure at multiple time scales (McFee et al., 2017). One of the attributes distinguishing an ordinary sound sequence and a musical piece is this intricate structure. In a Hindustani music concert, the artist selects a particular melodic framework (raga) appropriate for the mood of the occasion and presents a pre-composed piece called bandish, including an elaborate improvisation adhering to the raga rules. Raga rules specify the number of pitch intervals in an octave, their hierarchy and precise intonation as well as the various ascending/descending phrasal contexts (Widdess, 2013). As such, raga delineation is a highly structured activity.
Dhrupad is the ancient North Indian classical style of singing, rendered with the tambura as the accompanying drone and the pakhawaj as the percussive accompaniment. The word Dhrupad refers to the compositional form as well as the genre of Dhrupad which includes singing of raga alap (an improvised performance based on the raga grammar) and pada (a composition interspersed with layakari, its melodic-rhythmic improvisation in performance) (Widdess, 1994). Thus a Dhrupad performance is an elaborate exploration of the raga that starts with an improvised, un-metered alap and is followed by the selected composition with lyrics sung to the accompaniment provided by the pakhawaj. A raga performance, that lasts up to an hour, is subdivided unequally into the improvised and composed sections with the former taking up even up to 80 percent of the performance time. The improvisation or alap, is further divided into alap-proper or vilambit alap (slow and devoid of pulsation), the jod or madhya alap (medium and steady pulsation) and the jhala or drut alap (faster pulsation). In some concerts, jalad jod or jalad jhala (faster versions of jod and jhala respectively) are present as well (Clayton, 2001). Hence, the number of sections in an alap is three or more. Figure 1 shows the back cover of an audio CD of a Dhrupad performance, where information about the total duration of the alap, and about each of the two composed sections, is provided without further break-down of each of the alap, jod and jhala subsections.
Throughout the alap, the singing comprises phrases made up of syllabes such as na, re, te and di, accompanied only by the steady drone. The customary vocal range of Dhrupad is two to two and a half octaves, and the alap begins around the middle octave tonic. The vocalist explores the raga, note by note, sometimes reaching down to the tonic of the lower octave. After exploring the lowest octave, the rendition moves up into the middle octave and ultimately ascends up to the tonic of the highest octave (Wade, 2001). This gradual, progressive melodic ascent is the characteristic of each of the alap-proper and jod sections of the Dhrupad alap. A musical description of the different sections of a Dhrupad alap is presented in Table 1. The table also mentions the typical cues that the performer uses to indicate the major structural boundaries, the temporal scale of interest in this work. The hierarchical organization of the concert implies the existence of smaller segment boundaries that are similarly cued. For example, the musician utters a specific melodic motif noom at 'the transition of a melodic thought' (Wade, 2001). This is especially significant when the concert involves a pair of singers (an arrangement not uncommon in the Dhrupad style) who alternate with each other as though engaged in a conversation. Further, a specific rhythmic phrase called mohra comprising the syllable sequence ra-na-na-na-na ta-na-tom-ta-na-na signals a new section boundary when it is rendered at a new higher speed. The section boundaries of interest to us, as indicated in Table 1 are therefore cued by the noom and mohra, periods of silence, and abrupt changes in pulsation (via changing syllable rate). While the sections and their characteristics are distinctive of Dhrupad, it must be noted that the genre ' also comprehends incredible diversity as cultivated by individual musicians' (Wade, 2001).

Alapproper
Rhythm free, slow and elaborate development of raga notes and phrases. A wide melodic range is spanned with focus gradually shifting from middle octave tonic to lower and then the higher octave. The melodic glide noom and mohra phrase serve as boundary cues.

Jod
Introduction of regular and slow pulsations via syllable rate. Melodic development and boundary cues similar to Alap-proper.

Jhala
Pulsation accelerates indicating climax. Syllable articulation more regular. The melodic range spanned is relatively narrow. Figure 1: Back cover of a Dhrupad concert audio CD by Pt. Nirmalya Dey.

Structural Segmentation Literature Review
Structural boundaries in music are based on changes in the musical attributes of melody, timbre, rhythm or harmony. This motivates the choice of acoustic features used in a segmentation task. Perceptually, timbre represents the sound quality, by which humans distinguish for example different instruments. Chroma features, or pitch class profiles, capture the melodic and harmonic content of a music piece. The relative strengths of each of the 12 notes of the equal-tempered scale characterize both the melody and the harmony in Western music. Timbre and harmony based segmentation methods are more common in Western music and have been used by many researchers to segment popular music into intro-chorus-verse-outro sections (Dannenberg and Goto, 2008;Paulus et al., 2010). Another widely used feature to characterize the global timbre of audio is the set of Mel Frequency Cepstral Coeffients (MFCCs) that parameterize relative sound levels in critical frequency bands (Logan, 2000;Foote and Cooper, 2003). Recently, Allegraud et al. (2019) reviewed the challenges inherent in the segmentation of largescale structure in the classical sonata form and proposed methods exploiting music theory with computed melodic, harmonic and rhythmic features to characterize the evolving structure as a sequence of recurring states. Rhythmic cues have been exploited for segmentation of Chinese popular music by Jensen et al. (2005). Unlike in Western music, a beat is not prominently present in some music styles. Hence, the autocorrelation of an accent curve representing the note onsets has been used to get a more robust feature of rhythm over alternate methods based on inter onset interval histograms (IOIH) that rely on a precise location of onsets (Dixon, 2001). A low-dimensional rhythmic representation that captures tempo was used effectively for Western classical music by Grosche et al. (2010). Jensen (2006) incorporated timbre, as well as chroma features along with rhythmic features in the segmentation of various styles of music with visualization using the timbregram and chromagram, confirming that multiple music dimensions are needed to account for the diversity inherent in music. Similarly, annotation principles, segmentation approaches and features were examined for structural segmentation of Chinese traditional Jingju music by Tian and Sandler (2016).
Different approaches are possible for segmenting music into sections -a homogeneity-based approach that locates the sections consistent in some musical aspect, a noveltybased approach that detects a sudden change in musical properties, or a repetition-based method that identifies the recurring patterns in a piece of music. A review by Paulus et al. (2010) of structure analysis from the perspective of segmentation and grouping of similar sections indicates the prominent place of unsupervised approaches. A particularly popular method, introduced by Foote (2000), locates the boundary between contrasting musical features via a self-distance matrix (SDM). Supervised approaches for pop/rock songs utilizing difference features for characterizing changes in musical aspects like timbre, harmony, melody, rhythm have been explored by Turnbull et al. (2007). In this work, they use a boosted decision stump (BDS) classifier that is trained to predict boundary/non-boundary frames. More recently, Ullrich et al. (2014) applied convolutional neural networks (CNN) trained similarly on mel-scaled magnitude spectrograms on the SALAMI structural annotation dataset that spans a large variety of genres (Smith et al., 2011).
Reviewing structural segmentation in the context of Indian art music, Ranjani and Sreenivas (2013) carried out a hierarchical classification of concert sections in the South Indian genre of Carnatic music based on rhythmicity and percussiveness, strongly signaled by the onsets and low frequency content of the accompanying percussion instrument. Thoshkahna et al. (2015) exploited the salience of the estimated tempo to distinguish sections with ambiguous tempo (alapana) from the later concerts sections with clear rhythmic properties in Carnatic music concerts. Rhythmic analysis of Indian and Turkish music was explored by Srinivasamurthy et al. (2014), where beat tracking, meter estimation and downbeat detection were identified as musically relevant tasks that could benefit from computational methods. The rhythmic description of Indian music must consider multiple time spans at various levels of hierarchy. Gulati and Rao (2010) explored the use of different signal processing methods for rhythm pattern extraction and evaluated these for tempo detection in North Indian classical music.
Segmentation of Indian instrumental alaps into alap-jodjhala sections based solely on the rhythmic attributes viz., tempo and its salience, was proposed by Verma et al. (2015), and homogeneity within alap sections was enhanced with the use of posterior probability features from unsupervised modeling. Recognizing that the instrumental music has structural similarities with Dhrupad vocal music, the rhythm features and unsupervised segmentation developed for the plucked strings were tested as such on a small vocal alap dataset. The observed low segmentation performance in this case was speculated to come from the inherently harder problem of syllable onset detection in vocals, and future work suggested "to explore new features" for Dhrupad vocal concert segmentation.
In the present work, we start with rhythmic cues, given the most obvious musical property that distinguishes alap sections, namely the tempo. Motivated by the observations of Verma et al. (2015), we explore new features based on other known and observed musical properties to improve the robustness of vocal alap segmentation over a larger and more structurally diverse dataset. In particular, the dataset now includes concerts with a varying number of sections necessitating new work to handle this unknown. The self-distance matrix provides an excellent framework for visualization, and subsequent computation of the section boundaries in a completely unsupervised manner. Given that the SDM with a suitable choice of feature vector has been an influential method of structural segmentation, we use it as a baseline system in this work. We further explore two supervised methods using the newly enhanced set of melodic-rhythmic features for the detection of structural boundaries, viz. a random forest classifier and a CNN classifier.

Data Annotation and Evaluation
A dataset for the structural analyses of Dhrupad vocal concerts was put together from available commercial recordings of live performances by leading musicians of the genre. Full length Dhrupad alaps were extracted from concert recordings of five leading artists viz. Gundecha Brothers, Uday Bhawalkar, Ritwik Sanyal, S Wasifuddin Dagar and Sulabha -Manoj Saraf. Nearly half the audios in the dataset are duet performances by two male musicians, the Gundecha Brothers, and one is a duet by a male and a female vocalist. In the duet performances, while the musicians sing alternately most of the time, some phrases are sung in unison. All the musicians are senior exponents of Dhrupad's prominent Dagar tradition, having performed since the mid 1980s. 1,2 The performances show structural diversity, with the number of sections ranging from 3 to 6, with some containing additional faster versions of jod and jhala. The concerts are rendered in different ragas. The concert audios were annotated for the major structural boundaries. We finally have a dataset comprising 20 alap audios of a total duration of 762 minutes, with 53 section boundaries as presented in Table 8, with Section 7 providing a link to further information in the interest of reproducibility.

Data Characteristics and Annotation
The Dhrupad alap can be viewed as a succession of states, each manifested in an audio segment with a certain musical 'role' with a characteristic behaviour (Peeters and Deruty, 2009). The presence of regular pulsation and its speed are the most distinct properties of a section within a concert. The section boundary itself is cued in multiple ways with both static and dynamic behaviours as presented in Table 1. Towards the end of a section, the vocal melody has typically progressed to the highest notes in the range. The melodic ornament, noom, is executed next, ending at the middle tonic before a mohra is uttered signifying a change. Given the multiplicity of musical cues, manual labeling of a boundary is expected to have a strong element of subjectivity. We used a consensus based approach involving a discussion between one of the authors and a trained Dhrupad artist. Given that the pulsation speed associated with a section is the most immediately recognised property for a listener, the boundary between sections was marked consistently at the onset of the vocal phrase that introduced the new syllable rate, typically the mohra. This ensured the repeatability of the labeling. However, the acoustic features computed in this work correspond to the distinct melodic and rhythmic cues of Table 1, which are actually spread out in time to varying extents (between 3 s and 20 s) in the different concerts. Figure 2, computed with speech analysis software PRAAT (Boersma and Weenink, 2017), shows an excerpt with the glide noom and the mohra appearing at the jodjhala transition. In this case, the mohra is the first vocal phrase at the higher speed of the jhala. It must be noted that the boundary cues, noom and mohra, can occur independently within concert sections as well, and when they occur at the boundary between sections, they can be separated by varying extents of silent pauses in the singing. This temporal spread of the various cues indicates the ambiguity inherent in the musical boundary instant location.
The concert sections are of unequal duration, of the order of several minutes, and vary considerably across concerts. Some alaps contain a faster (jalad) section of jhala, while a few contain a jalad-jod section. Figure 3 displays the diversity of section durations and mean tempi across concert sections in the dataset. We note that, showing the melodic glide noom (with its first harmonic in the box). unlike the case of the Western music song, a section lasts for several minutes. The time scale of variations is thus relatively large and expected to influence the choice of feature analysis contexts and parameters.
In alaps containing more than three sections, the sections are labeled as either jhala or jod based on the regularity of syllable articulation and the melodic range spanned (Clayton, 2001). We see that the later sections have higher tempi while there is some overlap across the different concerts.

Train-Test Sets and Evaluation Criteria
The structural segmentation of vocal alap is implemented in this work via the automatic detection of the major section boundaries. We have 20 concert recordings with 53 ground truth boundaries marked in our dataset, described by Table 8. In the interest of testing over a large enough dataset while avoiding train-test leak, we evaluate our supervised segmentation systems using leave-one-concertout 20-fold cross validation as depicted in Figure 4. In each fold, the 19 concerts forming the training set are time-and pitch-shifted in order to obtain an augmented train set. The audios are time-shifted by delays of 0.1, 0.2, 0.3 and 0.4 seconds and pitch-shifted by -2, -1, 1 and 2 semitones. 3 Although the time shifting does not really alter the signal, it changes the frame-level acoustic feature values. The (heldout) test concert is subjected only to time-shifting in order to maintain the acoustics of the test data as such. With 5 test concerts per fold, we finally report performance that is averaged over the augmented test set of 100 concerts in each of the supervised and unsupervised system evaluations.
Viewed as a concert section boundary detection task, we examine uniformly spaced intervals of the audio signal (of duration 1 s, as explained in next section) for the presence or absence of a boundary. A detected boundary is declared a hit if the prediction is within ± Tol seconds of a ground truth (GT) boundary; otherwise, it is a false positive. Based on the average inter-judge labeling differences noted by Verma et al. (2015), and also consistent with the observed temporal spread of the different musical cues discussed in the previous section, we report performance with an evaluation tolerance of 15 s using the usual measures of precision, recall and F-value.

Acoustic Characteristics and Features
In Dhrupad alap, the voice is accompanied only by the drone which tends to have a nearly flat and static harmonic spectrum. Of the musical properties presented in Table 1, the perceptually most salient characteristic of a section is the rhythm in the form of the local syllablerate or tempo. Further, the melodic development across a section is gradual with a reset occurring at the boundary between sections. Motivated by these observations, we consider multiple acoustic features as described next and summarised by the feature extraction flow-chart of Figure 9.

Rhythm Features
The basic dimensions of Indian classical music are melody and rhythm but at the largest time scales relevant to concert structure, rhythm is the prominent distinguishing attribute (Clayton, 2001, p. 96). In a broader sense, rhythm refers to all the aspects of musical time patterns such as the way syllables of the lyrics are uttered, the way the strokes of a musical instrument are played or the inherent tempo of a melodic piece. A rhythm representation can be derived by observing the regularity of note event onsets over a suitably long duration.

Vocal Onset Detection
The rhythm in Dhrupad alaps arises from the rhythmic rendering of vocal syllables. The onset of a syllable is marked by a transient event, characterized by a sudden burst of energy or a change in the short-time spectrum of the audio signal. A computationally simple and effective method of note onset detection involves calculating the temporal derivative of the short-time energy (Bello et al., 2005). The syllables typically uttered -na, re, ti, de, are marked by a prominent energy rise in the frequency band of 600-2800 Hz at the consonant-vowel transition (Kumar et al., 2007). The sub-band energy at frame n is given by Equation 1, where k is the frequency bin index, |X[n, k]| is the spectral amplitude feature computed using the short-time DFT of the input audio signal and W[k] is the band limiting filter response with unity gain in the 600-2800 Hz band. The short-time spectrum is computed with a sliding hanning window of duration 30 ms and a hop of 10 ms corresponding to a frame rate of 100/s. An onset instant is then a peak in the derivative of SB_Ener [n]. A robust estimate of the derivative is obtained by incorporating some smoothing prior to differencing via a bi-phasic Figure 4: One instance (fold) of the 20-fold cross-validation process adopted for train-test data splitting. ! " # # #$ #% %& function serving as a filter. A discrete time filter, h[n], is obtained by sampling the impulse response of the biphasic filter recommended for vowel onset detection (and given by Eq. 4 in (Hermes, 1990)) at the required 10 ms frame intervals. Figure 5a shows a plot of h[n] superposed on the underlying continuous-time biphasic function whose parameters are the lobe widths and locations of its two peaks. The same filter was used effectively in the context of sung and hummed notes by Kumar et al. (2007). An onset detection function (ODF) is obtained then by the convolution of the sub-band energy function with the filter impulse response.
The quality of the onset detection function is expected to influence the reliability of the rhythm representation derived from it. We therefore tested the ODF independently on a selected diverse set of 130 labeled vocal syllable onsets across 6 concert segments spanning jod and jhala sections. We obtained a recall of 0.7 and a precision of 0.8 at the peak-picking threshold corresponding to the best F-score of 0.75. The performance was observed to be superior in jhala due to the more regularly articulated syllables relative to the jod utterances. However, the onset detection performance overall on vocal syllables is significantly lower than that achieved on sitar and sarod plucks in instrumental concerts (Vinutha et al., 2016). This attests the challenge in vocal music posed by the greater diversity of phonetic realisations and singing styles.

Tempo Estimation
The local tempo can be estimated by measuring the periodicity of the onset detection function. Given the frequent occurrences of brief, intermittent silences in the singing, we choose a window duration of 20 s over which to compute the short-time autocorrelation function (ACF). The ACF of the onset detection function (sampled at 10 ms intervals) is computed for up to 300 lags (range of 0-3 s, which spans several tens of pulses at the lowest expected tempo of 100 BPM). A powerful visualization of the periodicity captured by the short-time ACF is seen in the image of ACF strength versus time and lag in Figure 6 known as a rhythmogram (Jensen et al., 2005). We observe the absence of periodic structure in the alap-proper section, while the jod and jhala sections are characterized by a strong periodicity, indicated by the horizontal striations equispaced in lag. The decreasing separation between striations indicates an increasing rate of onsets, i.e., increasing tempo. The boundaries between the sections are clearly visible in the rhythmogram, suggesting that the ACF could serve as a feature vector for SDM-based segmentation.
The ACF is a high dimensional vector that embeds the rate of pulsation given the absence of metrical hierarchy in the Dhrupad alap. It can potentially be replaced by a single tempo value. A reliable method of tempo detection combines the ACF and DFT in a product that yields tempo estimates relatively free from octave error (Peeters, 2007). Next, the normalized ACF peak value corresponding to the detected tempo is used as a measure of salience or pulse clarity, and detected tempo values with a low salience (<0.1) are clamped to zero. We thus obtain the two dimensional vector (tempo, salience) as a compact alternative to the high-dimensional ACF vector. Figure 7a and 7b show the time-varying tempo and salience respectively. We observe that within the jod and jhala sections the tempo gradually increases, with a jump at the boundaries, while in the rhythm-free alap section the estimated salience is uniformly low and detected tempo, random. Verma et al. (2015) showed that transforming the rhythm feature vector of (tempo, salience) to a vector of classconditional probabilities (or posteriors), where the classes  comprise the distinct sections, improves homogeneity of the resultant features within a section. Each feature vector V i of a frame i of a concert is transformed to a vector q i whose length matches the estimated number of sections, K. We derive the posterior features by the unsupervised clustering of the rhythm using a GMM model with K Gaussians (C 1 , C 2 , …, C K ) representing the K sections. Thus, each q i comprises:

Posterior Features
In Equation 3, the k-th dimension of q i represents the posterior probability P, given the frame vector V i , of the k-th Gaussian component. The GMM is trained with maximum likelihood across all the frames in a given concert (i.e. in an unsupervised manner). We expect noisy feature values, which in turn affect the homogeneity of a section, to be ideally mapped to low probability values in the posterior vector. Figure 7c plots the posterior probabilities with time. We see that the unsupervised clustering has indeed resulted in peaky posteriors with only one or the other of the 4 probabilities in the posterior vector (presumably the one corresponding to the Gaussian representing that section) dominating in a given ground truth section, with relatively sharp transitions at the boundaries. Verma et al. (2015) set the number of sections to a fixed value (K = 3) since this was the ground truth across all 10 concerts in their dataset. Given the greater diversity in the current dataset (where the number of sections ranges between 3 and 6), the question arises of estimating the number of sections, K. We apply the Bayesian Information Criterion (BIC), a likelihood criterion penalised by the model complexity, to estimate K in the expected range (Chen and Gopalakrishnan, 1998). We choose the value of K that minimises the BIC criterion for the given concert. This gives us finally a variable dimension vector of posterior probabilities. It was observed that the number of sections is estimated correctly in 13 of the 20 concerts and over-estimated by 1 or 2 in the remaining, typically longer duration, concerts.

Melody and Timbre Features
We note from Section 2.1, that a concert section is marked by the progression of the melody from the middle octave, down, and then up to the higher reaches. We see this in Figure 2 where the pitch (as indicated by the harmonic spacing) is consistently higher before the boundary relative to that just after. The transition between sections is therefore marked by a prominent reset in the singing pitch. Pitch and chroma features essentially manifest this change but are found to be affected similarly by the melodic development within the section apart from the challenges presented by the nearly 3-octave range. A related but more robust attribute is found in the changing loudness and brightness of the voice with pitch, arising from the increased sub-glottal pressure or vocal effort required to produce higher pitches (Sundberg, 1990). We therefore examine the use of timbre features in section boundary detection. The short-time log magnitude spectrum computed on the auditory mel scale (log melspectrogram), using 40 filters in the 80 Hz-8 kHz band, is shown in Figure 8 across duration for the UB_AhirBhrv alap. We can see the shift in energy away from the lower frequency bands as the melody attains its height near the boundary. This is accompanied by an increase in frequency spreading. This changing spectral shape, due to changing vocal intensity and brightness, can be compactly represented by the short-time energy and the spectral centroid respectively. We consider these in our feature design as also mel-frequency cepstral coefficients (MFCC), given their ubiquity in music classification tasks involving timbre (Logan, 2000). We include the 13-dimensional MFCC vector (coefficients C-0 to C-12) in our set of features for evaluation. The above features are computed at the 10 ms frame level, over 30 ms long sliding windows, and then subsampled to a 1 s frame level after averaging over suitably long sliding windows. The length of the averaging window controls the smoothing and a longer window helps remove the fine fluctuations irrelevant at the larger timescales that we are interested in. We experiment with two relatively extreme values for the averaging window length -a short 3 s window relating to the duration of the noom glide, and a much longer one, 20 s, relating to longer-term trends in melodic pitch. With the intention of obtaining the section boundaries as the instances of change, we further compute derivatives of the short-time energy and spectral centroid features by convolving each with the discrete version of a biphasic filter given by Figure 5b. The peak width and location parameters were experimentally tuned to maximize the strength of peaks in the immediate vicinity of labeled boundaries. The outputs of the filter are referred to as the short-time energy (STE) difference and short-time centroid (STC) difference. Figure 7d shows the frame-level short-time energy difference with the anticipated sharp rise at the boundaries but also several peaks within each section. Figure 7e shows the spectral centroid difference, where again, peaks can be seen occurring at the marked ground truth boundaries. Figure 7f shows the second MFCC coefficient (C-1), associated with spectral envelope tilt, rising sharply at the start of every section and gradually dropping as the section progresses. We observe that both the spectral shape indicators are characterised by clearer boundary effects compared to the short-time energy.

Structural Segmentation Methods
In our context, structural segmentation involves the detection of change between contrasting musical parts dictated by one or more musical attributes. This can be viewed as a boundary detection task where each frame, occurring at the rate of 1 Hz, is to be classified as a boundary or non-boundary frame based on whether a transition between musical sections occurs over the frame duration. The features, computed as presented in Figure 9, and summarised in Table 2, are individually   z-score normalized across each concert to obtain a mean of 0 and a standard deviation of 1 to derive the classifier inputs. Boundary detection can be achieved by an unsupervised framework involving the SDM and kernel correlation or by the supervised classification of the 1 s frames as boundary/non-boundary events. We also investigate feature learning, from the (relatively low-level) log mel spectrogram representation, via a CNN classifier. In all cases, the evaluation uses the 100 time shifted audios generated from the original 20 concerts to obtain reliable measures of boundary detection performance.

Unsupervised Segmentation
Given a feature vector stream, the SDM can be computed using a chosen distance measure, in our case the L2 distance (Paulus et al., 2010). Thus, a homogeneous segment of length M frames would appear as an M × M block of low distance values. Next, points of high contrast in the similarity matrix are captured by convolution along the diagonal with a checker-board kernel of dimension matched to the time-scale of interest (Foote, 2000). Given that the minimum section duration is about 100 s in the dataset, we examine kernels of size 50 × 50 and 100 × 100. The one-dimensional plot resulting from the convolution is called a novelty function, whose peaks indicate the boundary time instants in the feature vector stream. Figures 10 and 11 (a, b & c) present the SDM and novelty function respectively for the ACF, rhythmic features and the posterior features for the UB_AhirBhrv alap with the 100 × 100 kernel. The novelty function derived from the ACF is observed to be noisy. The rhythmic features, tempo and salience, improve upon this to an extent, while the posterior features visibly improve the homogeneity of the sections and consequently the accuracy of detected peaks in the novelty function, confirming the observations made by Verma et al. (2015) for instrumental concerts.
It was observed that, rather than combining both rhythm and timbre features into a single vector for the SDM, fusing the information in the distinct feature streams at the peak picking stage provided more flexibility in terms of tuning the performance of the system. The STEdifference and STC-difference already represent feature derivatives and are treated directly as novelty functions for peak picking. The SDM for the 13-dimensional MFCC vector and the corresponding novelty function obtained by convolution with a chosen kernel size (100 × 100) are shown in Figures 10(d) and 11(d) respectively. We observe clear peaks at the labeled boundary instants, but also a few spurious peaks within sections arising from local timbre variations.
Next, the highest N (varied from 1 to 18, treated as a tunable parameter) peaks are picked in each feature's novelty function stream as boundary predictions, while ensuring that no two selected peaks are within 30 s of each other. For the information fusion, MFCC is taken as the reference as it is found to perform best among individual feature categories. Boundary candidates derived from novelty functions of the rhythm feature vector, MFCC vector and each of the two 1D timbre features are fused using a majority rule (i.e. two or more features out of three are checked for coincidence with the MFCC reference).

Supervised Segmentation Methods
As mentioned in Section 2.2, classifiers are trained on the augmented dataset that includes the pitch and time shifted versions of all the audios in each fold. To account for ambiguity in manual boundary marking from both, the variety and temporal spreading of the section change cues, targets are smeared by labelling all the frames in a ±15 second window about the manually labeled boundary as "boundary" frames (Ullrich et al., 2014). Next, in order to balance the non-boundary and boundary frame examples in training, only the boundary-labeled frames of the newly generated audios are retained, along with all of the frames of the original dataset. Based on the expectation that frame-level boundary detection would benefit from context, features from adjacent frames in a ±C s neighbourhood are appended to the current frame features. The corresponding target is a label indicating the presence or absence of a manually labeled boundary at the center frame. The classifier output is the estimated probability of a boundary in the frame. During evaluation, the frame-wise boundary predictions are post-processed by replacing predictions within a 30 s window with a single one of the highest strength (probability) given that the distinct acoustic cues to a section boundary can be spread over several seconds. The obtained frame-wise values are compared with a threshold to obtain the detected boundaries. The threshold is varied to obtain a Receiver Operating Characteristic (ROC) curve and to choose the point of best performance in terms of F-score with reference to the manually labeled boundaries.

Random Forest Classifier
A random forest classifier consists of an ensemble of decision trees and outputs the class with the majority vote (weighted by the corresponding probability values) as the model's prediction. The classifier is trained on input vectors of features (posterior rhythm and timbre) including those of the current and context frames. A target of 1 or 0 is assigned to each input training vector indicating whether the current frame is a manually labeled boundary or not. The posterior rhythm feature is a vector whose length is ideally equal to the number of sections in the alap, meaning that the length is not the same for every concert. Moreover, the values of this feature at every frame indicate the probability of the frame belonging to a particular section, and hence, for all the frames within a section, only one of the posteriors dominates, whereas very near a boundary, the values are observed to be more distributed in the [0,1] range due to the non-homogeneity of frames close to the boundary. Therefore, only the maximum value in the posterior rhythm vector is used as the corresponding feature.
While there are several model hyperparameters in a random forest classifier that can be tuned to optimize the performance, we focus mainly on optimizing the number of decision trees, and leave the others at their recommended values. 4 In addition, we experiment with some hyperparameters related to feature extraction, such as averaging window size and context duration. The number of decision trees is varied between 10 and 100 in steps of 10, while the context duration (C) is swept from ±10 to ±50 seconds in steps of 10 s. The window used to average the timbre features over is set to each of the two values, 3 s and 20 s, while the window for the rhythm feature computation from the onset detection function is always set to 20 s, as explained earlier.

Convolutional Neural Network
We investigate the direct learning of features from the data by the use of a convolutional neural network (CNN)based classifier. A relatively generic form of CNN was successfully employed for music structure analysis on the large SALAMI dataset (Ullrich et al., 2014). In our work, the features, or representation, are learned from the input mel-spectrogram (as described in Section 3.2) by training with frame-level targets indicating the manually labeled boundary and non-boundary frames. As we wish to obtain boundary predictions at a 1 s frame resolution, the logmel-spectrogram is then sub-sampled by averaging within overlapping windows of a suitable size, with a hop of 1 s (in line with the use of averaging windows in the feature extraction step described earlier). The mel-spectrogram is next split into smaller overlapping chunks of size 40 × N C , by taking ±N C /2 adjacent frames as context (corresponding to ±C s) for each frame input to the network. The input to the network is normalised so that all values in a single input chunk lie in the range -1 to +1. The target output is 1 or 0 indicating a manually labeled boundary or none in the center frame.
The CNN model architecture used in this work is based on Ullrich et al. (2014) with two conv. layers, the first with 16 filters of 6 × 6 kernels (6 time frames, 6 mel-bands), and the second with 32 filters of 3 × 3 kernels. Each conv. layer is followed by a max-pool layer of dimension 3 × 3. Next, we use a fully-connected (FC) layer with 128 hidden units and finally an FC layer (output) with 2 units. Each of the conv. layers applies a ReLU activation, the first FC layer applies a sigmoid, and the output layer applies a softmax activation. We also experimented with non-square kernels and with other combinations of activation functions -using the tanh instead of ReLU for the conv. layers, and using ReLU for all the layers. These initial experiments are used to fix the model hyperparameters, before trying to optimize the input feature hyperparameters like the averaging and context window durations. We experiment with two values for the averaging window (duration over which the mel-spectrogram is locally averaged with a sliding window) -3 s and 20 s, and in each case vary the context duration (C) from ±10 to ±50 s in steps of 10 s. The model is trained using binary cross-entropy loss on mini-batches of 64 samples, over 20 epochs, using an Adam optimizer with an initial learning rate of 0.001. The model is trained and evaluated using a leave-one-concertout method as depicted in Figure 4. The epoch resulting in the least validation loss computed on the held-out concert is used to obtain the boundary predictions on the corresponding concert.

Results and Discussion
We present the segment boundary detection performances of our baseline unsupervised system and the two supervised (RF and CNN) methods as averages obtained across our test set of 100 concerts including the timeshifted versions of the original audios. We report results for the distinct, as well as combined, feature subsets specified in Table 2. Table 3 presents the unsupervised boundary detection performance obtained using the different feature sets. Testing across the system hyperparameters for the rhythm and timbre feature categories, a shorter averaging window of 3 s for the STE and STC features, and a wider 100 × 100 checker-board kernel for the SDM (used for the rhythm and MFCC feature vectors) are found to result in higher F-scores relative to the other considered alternatives. The optimal number of candidate peaks for each novelty function stream is found to be N = 7. It can be seen that while the timbre features bring only a small improvement over the sole use of MFCC, the fusion of timbre and rhythm features is clearly superior to segmentation based on timbre alone. Table 4 shows the best boundary detection performance obtained using each feature subset, and the corresponding values of the hyperparameters. An averaging window duration of 20 s for the timbre features was found to yield the best performance and hence all the results are reported with this setting. It can be seen that the rhythm features alone do not perform well, while the timbre features do significantly better, and a combination of the rhythm and timbre features results in a small further boost to the performance.

Random Forest Classifier
It is interesting to note that MFCCs alone perform comparatively well, but with a much higher context duration than when using all the timbre features. However, the improved score using all the timbre features comes at a cost of a higher number of trees. Adding rhythm features increases the F-score further only slightly, and at the cost of even more trees in the classifier. Further, the best F-score for each feature subset is not obtained necessarily at the highest values of the context duration and number of trees, suggesting that after a point the model starts to overfit. Also interesting, although not explainable, is that while in the unsupervised case a 3 s timbre averaging window worked best, the RF classifier did best with a 20 s window.
We also train the model without any of the data augmentation discussed in Section 2.2 to assess its contribution to the overall performance. These results are reported only for the all features condition, appearing within parentheses in Table 4. It is seen that augmentation clearly helps improve the recall in boundary detection.

CNN
In the CNN model-related experiments we started from the architecture described in Section 4.2.2, and experimented with rectangular kernels for the conv. as well as the pool layers. The rectangular kernels in the pool layer were made longer only in the frequency dimension in order to preserve temporal resolution. During these preliminary experiments, the averaging window and context duration were set to 3 s and ±20 s, respectively. The final architecture that yielded the best results had kernels of size 3 × 6 in the conv. layers (3 along frequency and 6 along time), and of size 3 × 3 in the pool layers.   Table 5 shows the results obtained using this architecture, for two context durations, ±20 s and ±50 s (motivated by the results from the RF classifier). Results are reported only for the case with an averaging window duration of 3 s, since these were significantly better than with a longer 20 s. It is evident from the results that the CNN classifier performs better with a larger context duration of ±50 s which provides a wide view of the pre-and post-boundary acoustic cues. Changes to the activation functions in the conv. and the fully-connected layer did not affect the F-score.

Discussion
The boundary detection performances of all the three approaches are consolidated in Table 6 for the best hyperparameter settings of each as determined in the previous sections. Given that the CNN classifier does not receive an explicit rhythm representation of the signal, we present in addition the other methods minus rhythm features. The supervised approaches, CNN and RF classifier, perform equally well while the unsupervised approach is clearly worse. With the low-dimensional rhythm features alone, the unsupervised system performs better than the supervised RF classifier (as we saw in Tables 3 and 4). The latter benefits only slightly in precision from the rhythm features. An interesting observation was that the novelty peaks from the rhythm features signaled the boundaries more distinctly than did the timbre features in certain instances where the latter were unreliable such as the male-female duet where the melody reset at the section boundary was obscured by the constant switching between the individual singers' ranges. Across concerts, false positives were seen when boundary-like cues such as the noom and mohra appeared within sections (i.e. without a tempo increase), signaling here changes at some lower level in the structural hierarchy. Missed detections were observed in instances where the tempo change between sections was relatively low. The interaction of multiple cues is evident overall.
We report next the performance of the above configured systems on two new concerts (outside the previous dataset and therefore not included in the systems' hyperparameter optimization). The concerts are further distinguished in that they are by a young, female Dhrupad vocalist, 5 over 20 years apart from our dataset artists. The alaps, each about 20 minutes in duration, contain two boundaries each, demarcating three sections. In a departure from the vocal style of the previous artists, Pelva intersperses the normally uttered Dhrupad syllables with the vowel ' a', thus reducing the clarity of the computed rhythmogram even though the tempo change at section transitions remains perceptually salient. She also sometimes omits the section change cue noom. The characteristic shifting of melodic focus across the section and its abrupt reset to the middle tonic are maintained but with one case of a larger than usual separation in time between the tempo cue and melodic cue instants in the second concert. Finally, the recordings are marked by relatively loud tambura (drone) background. Table 7 presents the results obtained on an augmented test set of 5 time-shifted audios corresponding to each concert. We observe that the RF classifier performs best, with performance similar to that obtained on the 20 concert dataset indicating that the system generalizes well. This is not true for the CNN classifier where the performance drops steeply, especially in the case of the second concert, affected probably by the above noted concert specific variation. The unsupervised system exhibits a similar performance across the two concerts although the F-score is slightly worse than that obtained on the previous 20 concert dataset.

Conclusion
The Dhrupad alap is a highly structured performance within an improvisational framework. Rhythm or tempo marks the evolution of the concert in time with abrupt   changes at section boundaries. Within each section, melodic development plays out in a similar way with the gradual shifting of melodic focus starting from the concert tonic. The above musical cues were found to be effectively captured with acoustic features related to syllable rate and vocal brightness, both computable from the short-time magnitude spectrum representation of the audio recording. Thus, our work provides us with explicit descriptions about the audible structure of the alap, an important constituent of the listener's unconscious schematic expectations (Widdess, 2011). With the given training dataset, a supervised classifier trained on the hand-crafted features performed best overall. While the perceptually most distinguishing characteristic of the concert sections is the syllable rate or tempo, the more powerful cues in the automatic detection of boundaries were found to be the abrupt melodic transitions. More reliable means of detection of the vocal syllabic onsets can potentially lead to more robust rhythm features. A take-home message is therefore that it can be rewarding to investigate MIR methods on new dataset/task scenarios, both from achieving reasonable performance outcomes and for drawing a deeper understanding of genre characteristics from the acoustical analyses. While the CNN classifier performs competitively on the 20 concert dataset through a purely learned representation, it is more affected by concertdependent variations that could have resolved with larger and more diverse training data. The possibility of learning features for structural boundary detection in an unsupervised manner is also an attractive prospect provided a sufficiently large genre-specific dataset can be assembled (McCallum, 2019). Finally, applying the outcomes of this research to the concert summarization task where the important musicological cues to section character are preserved is an interesting topic for future work, together with its extension to the composition sections of the Dhrupad concert that are rendered after the alap (Ranganathan, 2013).

Reproducibility
All annotations, code and trained models are available at this link: https://github.com/DAP-Lab/dhrupadalap-segmentation. The annotations contain section boundaries and labels for all the concerts used in the cross-validation and test experiments. The concert audios are not made available, but links to their sources are provided.