A- A+
Alt. Display

# Differentiable Short-Term Models for Efficient Online Learning and Prediction in Monophonic Music

## Abstract

As pieces of music are usually highly self-similar, online-learning short-term models are well-suited for musical sequence prediction tasks. Due to their simplicity and interpretability, Markov chains (MCs) are often used for such online learning, with Prediction by Partial Matching (PPM) being a more sophisticated variant of simple MCs. PPM, also used in the well-known IDyOM model, constitutes a variable-order MC that relies on exact matches between observed n-grams and weights more recent events higher than those further in the past. We argue that these assumptions are limiting and propose the Differentiable Short-Term Model (DSTM) that is not limited to exact matches of n-grams and can also learn the relative importance of events. During (offline-)training, the DSTM learns representations of n-grams that are useful for constructing fast weights (that resemble an MC transition matrix) in online learning of intra-opus pitch prediction. We propose two variants: the Discrete Code Short-Term Model and the Continuous Code Short-Term Model. We compare the models to different baselines on the “The Session” dataset and find, among other things, that the Continuous Code Short-Term Model has a better performance than Prediction by Partial Matching, as it adapts faster to changes in the data distribution. We perform an extensive evaluation of the models, and we discuss some analogies of DSTMs with linear transformers. The source code for model training and the experiments is available at https://github.com/muthissar/diffstm.

Keywords:
How to Cite: Bjare, M.R., Lattner, S. and Widmer, G., 2022. Differentiable Short-Term Models for Efficient Online Learning and Prediction in Monophonic Music. Transactions of the International Society for Music Information Retrieval, 5(1), pp.190–207. DOI: http://doi.org/10.5334/tismir.123
Published on 29 Nov 2022
Accepted on 12 Sep 2022            Submitted on 05 Nov 2021

## 1. Introduction

Commonly used parametric predictive sequence models for music such as recurrent neural networks (RNNs) (Rumelhart et al., 1986; Eck and Schmidhuber, 2002), convolutional neural networks (CNNs) (LeCun et al., 1989; Yang et al., 2017) and Transformers (Vaswani et al., 2017; Huang et al., 2018) are trained on a corpus of compositions to learn the statistics of a particular musical style. Musical pieces are typically self-similar; knowing parts of a piece can facilitate predicting other parts of the same piece. Also, the statistics within a piece (i.e., intra-opus1) can differ substantially from the inter-opus statistics comprising a whole corpus. Models trained on inter-opus statistics (which we will refer to as long-term models (LTMs) throughout this article) therefore need to adapt to intra-opus statistics at test time. Different architectures succeed in this task to varying degrees, with (autoregressive) Transformers currently being the most successful. The main drawback of these autoregressive LTMs is that considering all the relevant past (i.e., temporal context2) for prediction is costly. With increasing dimensionality of the data (e.g., in audio), LTMs quickly become impractical to use.

A different approach to exploiting musical self-similarity for prediction is to use short-term models that learn intra-opus statistics in an online fashion (which we will refer to as short-term models (STMs) throughout this article). Such models are usually smaller than LTMs because they only need to learn the statistics of the (recent) past instead of an entire corpus. Also, standard autoregressive models need to receive all relevant temporal context as input for every single prediction. In contrast, online learners can store some of that context in their inner state. That way, the required context length can be reduced which also reduces the computational cost.

Well-known examples of such STMs are Markov chains (MCs) (used as early statistical models of music by Quastler (1955); Pinkerton (1956); Brooks et al. (1957)), where online learning can be performed by simply counting the frequencies of the observed state transitions. However, common MCs are limited in different aspects. First, they can only operate on univariate discrete state spaces, which prevents a generalization to continuous data and makes it impractical, for example, to model (multivariate) polyphonic music. Second, they rely on exact n-gram matching and ignore potentially informative contexts that are similar but not identical to already encountered contexts. Third, they suffer from the so-called zero-frequency problem (Roberts, 1982), leading to invalid predictions when observing a specific context for the first time.

Variable-order MCs have been proposed to mitigate some of the challenges of fixed-order MCs by combining the predictions of various fixed-order MCs using sophisticated backoff-strategies and model weighting. However, these backoff-strategies impose a priori data assumptions, for example that more recent events are supposed to be more informative for prediction (see Section 2). Due to such assumptions, the performance of variable-order MCs may vary on different datasets (Witten and Bell, 1991).

An example of a variable-order MC is Prediction by Partial Match (PPM) (Cleary and Witten, 1984) used in Information Dynamics of Music (IDyOM) (Conklin and Witten, 1995; Pearce, 2005), a model of auditory cognition that aims to simulate the expectations of human listeners. IDyOM operates on so-called viewpoints, where the musical surface is represented by multiple handcrafted features. Also, in IDYOM two PPM models are used: the first acts as an STM for learning the intra-opus distribution of a specific piece, and the second serves as an LTM, intended to learn the intra-opus distribution of the entire corpus. The outputs of the two models are then combined to obtain the final predictions. For IDyOM, STMs are important constituents, as they simulate the online learning of humans when listening to a song. The goal of this work is to develop an improved version of an STM that can eventually be used in more complex musical contexts by learning relevant viewpoints for musical event prediction. That way, the IDyOM line of research could be extended to more general music representations, like polyphonic music or latent representations of audio.

To that end, we introduce a fully differentiable connectionist MC architecture that naturally generalizes to variable-order contexts and reduces the zero-frequency problem through approximate context matching. Instead of using observed discrete context and target spaces for constructing an MC transition matrix, we learn the context space using an encoder while using an observed discrete pitch space as the target space. The proposed approach is a form of meta-learning, as the model learns context representations optimized for online Markovian learning. That way, the resulting representations can be used to effectively compute a fast weight transition matrix analogous to the transition matrix of a common MC. As a result, we obtain a connectionist STM that can be trained in an end-to-end fashion. Even though the architecture is still limited to a univariate discrete target space, the context space could also be learned from continuous multivariate data. Altogether, the model has some critical advantages over common Markov chains, lifting most of the limitations mentioned above.

The article is structured as follows. In Section 2, we briefly review the workings of PPM. In Section 3 we review how MCs fitted with maximum likelihood (ML) estimation may be used as an online learning sequence model. Inspired by MCs, we propose Differentiable Short-Term Models (DSTMs): online learning STMs that make use of a context encoder that learns generalized context representations.

We introduce two variants: a Discrete Code Short-Term Model (DCSTM) and a Continuous Code Short-Term Model (CCSTM). As an intermediate result, we show that fixed-order MCs, DCSTMs and CCSTMs can be viewed as forms of linear Transformers (Katharopoulos et al., 2020). In Section 6 we test our models on the monophonic “The Session” dataset (Sturm et al., 2016) and find that our CCSTM has a better short-term performance than PPM. Our results also suggest that the CCSTM better handles the introduction of novel material (better adapts to changes in the latent data distribution) than PPM. We show that, after the intra-opus distribution has been established, the CCSTM performs better than a comparable LTM. Furthermore, we show that the CCSTM suffers less from a reduction of the (temporal) receptive field than a Transformer model because of its external memory. Finally, we show that the learned context features, like PPM, are sensitive to the more recent past and, unlike PPM, to temporal lags, which are essential to the musical form.

## 2. Prediction by Partial Matching

We contrast our work to the PPM model, which, like our models, learns (n+1)-gram probabilities of a sequence online. PPM has been applied extensively to the domain of music in the IDyOM model to analyze aspects of perception of music, including pitch expectations (Egermann et al., 2013), similarity perception, (Pearce and Müllensiefen, 2017) and phrase boundary perception (Pearce et al., 2010), to name a few.

To perform predictions at time t, PPM matches the current context of the last n observed events (which we will refer to as a query from now on), with already observed n-grams (referred to as keys in the following). In order to predict, PPM initially calculates ML estimates of nth-order MCs (Section 3.1) at various orders. The ML estimate is given by the relative frequency of events following the matching keys (we refer to the events following keys as values).3 PPM finally predicts using a weighting of these nth-order MCs. Using a weighting of predictions of different orders addresses a trade-off between model order and data scarcity (Section 3.2), which is an inherent problem of finite fixed-order MCs, and is, in particular, a challenge in the intra-opus scenario where the amount of training data is limited. A central limitation of PPM is that it relies on exact matches between keys and queries. For example, consider the first 8 bars of the composition “250 to Vigo (sessiontune9)” from “The Session” dataset shown in Figure 1. The query actually represents a variation of the key and precedes the same target note. However, as the variation occurs on the most recent notes, PPM would fail to match the key and query at any order. From a music-theoretical viewpoint, this implicit ordering of importance according to recency is not always natural. For example, the pitches of notes on strong beats (e.g., “down beats”) might be more informative for deriving the tonality of a piece than the notes on weak beats. Manually defining such music-theoretical rules seems unsatisfying, considering the vast number of possible ways in which compositions can unfold (even though this has been done by using multiple viewpoints in the work of Conklin and Witten (1995)).

Figure 1

First 8 bars of tune “250 to Vigo (sessiontune9)” from “The Session” dataset. The figure shows a key similar to the query, both of which are followed by the same pitch E. PPM would fail to match the key and query since the key and query do not match at any order n.

## 3. Differentiable Short-Term Models

In this section, we introduce the two DSTM variants: the Discrete Code Short-Term Model (DCSTM) and the Continuous Code Short-Term Model (CCSTM). The models are based on maximum likelihood (ML) estimation of transition probabilities in fixed-order Markov chains (MCs). In Section 3.1 we review how fixed-order MCs can be used to formulate a predictive model which learns to predict pitches intra-opus. In Section 3.2 we describe a predictive model that uses discrete context encodings. In Section 3.4 we define a model that uses real-valued (i.e., continuous) context encodings.

### 3.1 Online Learning nth-Order Markov Chains

The data used in this work are melodies in monophonic piano roll representation (a time sequence of symbols represented as one-hot encoded vectors, where an on-bit represents a note that is active at the corresponding time step, see Figure 3). Note durations are represented by activating the corresponding bit for several time steps (i.e., from onset to offset). Note that with this representation, two successive notes of equal pitch are necessarily fused together. We use a vocabulary of size d consisting of pitch symbols and two special symbols, one representing a rest and one used for sequence padding. A melody of N symbols can then be modelled by the ML estimated nth-order (discrete state) time-homogeneous MC (see Section 2).

In order to calculate the ML estimate, we collect frequency counts of (n+1)-grams (i.e., key/value pairs), observed until time t in a d × dn frequency matrix Wt. In the frequency matrix, columns correspond to observed keys, and rows correspond to observed values. The frequency matrix is calculated using one-hot encodings of keys and corresponding values. Let s0, …, sN–1 be a sequence of one-hot encoded symbols in a d dimensional space representing our composition. We prepend our composition with n padding symbols thus obtaining the sequence sn, …, sN–1. Using a bijective one-hot encoding function g we one-hot encode the keys by g(st–n+1, …, st) = ht, i.e., g maps any distinct key to a unique one-hot encoded vector in a dn-dimensional space. The frequency matrix Wt is calculated by a sum of outer-products:

(1)
${W}_{t}={W}_{t-1}+{s}_{t-1}{h}_{t-2}^{\top }={W}_{0}+\sum _{{t}^{\prime }=0}^{t-1}{s}_{{t}^{\prime }}{h}_{{t}^{\prime }-1}^{\top }$

since for each term in the sum

(2)

Therefore, each element of the transition matrix (Wt)i,j identifies the observed frequency of a distinct key/value pair. W0 encodes prior (unnormalized) transition probabilities which are discussed in Section 3.7. By projecting the query ht–1 onto the frequency matrix (calculating Wtht–1), we obtain the frequencies of (n+1)-grams where the keys match the query ht–1, and by normalizing we obtain an ML estimate of transition probabilities conditioned on ht–1:

(3)
$\begin{array}{ll}{P}_{t}\left({\stackrel{^}{s}}_{t}|{h}_{t-1}\right)\hfill & ={P}_{t}\left({\stackrel{^}{s}}_{t}|{s}_{t-n},\dots ,{s}_{t-1}\right)\hfill \\ \hfill & ={W}_{t}{h}_{t-1}/‖{W}_{t}{h}_{t-1}{‖}_{1}.\hfill \end{array}$

In Section 3.7 we define W0 in a way such that Equation (3) is always well-defined (the projection is non-zero), however, for now we assume that W0 = 0 to ease notation. We emphasize that the bijective function g transforms the nth-order MC: s–n, …, sN–1 into an equivalent 1st-order MC: h–1, …, hN–1.

### 3.2 Discrete Code Short-Term Model

A well-known theoretical result is that the uncertainty of a random variable does not increase while extending the set of conditioning variables, or in our case while increasing the context length (Cover and Thomas, 2005). As an effect, higher-order MCs tend to have higher predictive performance. However, when calculating an ML estimate of the transition probabilities, if the context length is increased, the frequencies of (n+1)-grams tend to be distributed among more keys. These low-frequency estimates could lead to poorly estimated transition probabilities. We refer to this as the model-order data scarcity trade-off. As illustrated in Figure 1 some keys and queries, which do not match exactly, might share structures that cause their predictions to be similar. Our goal is to replace g, which treats all non-identical n-grams as distinct, with a non-injective one-hot encoder network gθ with image dimension (number of discrete context encodings) mdn. The encoder should map keys and queries to the same code when they lead to the same predictions in a training corpus. Therefore, the encoder addresses the data scarcity trade-off of fixed-order MCs by allowing the model to operate at higher orders (consider a longer past) while retaining higher (n+1)-gram frequencies in the ML estimate due to the reduced state space. We emphasize that the 1st-order MC: h–1, …, hN–1, obtained from gθ, is not equivalent to the nth-order MC: sn, …, sN–1.

In Section 3.2.1 we describe how the encoder network can be defined in a differentiable manner. The encodings can then be learned offline on a training set through likelihood optimization of Equation (3) (see Section 3.5). When optimized over all time steps of a sequence, the online learning model should predict efficiently on the short- and long-term (for low and high values of t).

#### 3.2.1 Encoder Network

We use an encoding given by:

(4)
$\begin{array}{ll}{h}_{{t}^{\prime }-1}\hfill & ={g}_{\theta }\left({s}_{{t}^{\prime }-n},\dots ,{s}_{{t}^{\prime }-1}\right)\hfill \\ \hfill & =\text{GumbelSoftmax}\left(\hfill \\ \hfill & {\text{AutoregressiveNN}}_{\theta }\left({s}_{{t}^{\prime }-n},\dots ,{s}_{{t}^{\prime }-1}\right)\hfill \\ \hfill & \right),\hfill \end{array}$

where AutoregressiveNN is an autoregressive neural network that summarizes the context in a single vector (see Section 3.6) and GumbelSoftmax (Jang et al., 2017; Maddison et al., 2017) is a function that outputs approximately one-hot encoded samples from a categorical distribution. The output samples are furthermore differentiable w.r.t. the input, and therefore Equation (4) is differentiable w.r.t. its parameters θ. We hypothesize that using a categorical distribution might benefit early training to take advantage of all m discrete codes. We furthermore hypothesize that towards the end of the training, the distribution will tend to be deterministic.

GumbelSoftmax is based on the GumbelMax technique (Maddison et al., 2014), used to sample one-hot encoded categorical variables z with class probabilities πj for j = 1, 2, …, m and is calculated by:

(5)

where ej is distributed according to the Gumbel(0, 1) distribution (Gumbel, 1935). Due to the argmax function, Equation (5) is not differentiable. To ensure differentiability, GumbelSoftmax replaces the argmax function with a Softmax function:

(6)

where τ is a temperature determining the sharpness of the distribution. As τ → 0, the distribution of h approaches the distribution of z, i.e., a categorical (one-hot encoded) distribution with class probabilities πj. For τ > 0, Equation (6) is differentiable with respect to πj, and can therefore be incorporated in a neural network by providing the log-probabilities log(πj) as outputs of a network.

The approximation error of Equation (6) relative to Equation (5) can be made arbitrarily small by lowering τ. However, lower values of τ simultaneously increase the variance of the gradient. Jang et al. (2017) address this problem by annealing of τ, where initial learning has low gradient variance, and over time, the approximation of h to z improves. While samples obtained by low values of τ better approximate Equation (5), they are still not exactly one-hot encoded. In this work, we use discrete samples ht to calculate the frequency matrix Wt by Equation (1). Equations (1) and (3) are, however, well-defined when ht is a GumbelSoftmax sample. We might therefore consider ht as a one-hot encoded vector plus a noise vector, where the magnitude of the noise vector can be controlled by τ. Alternatively Jang et al. (2017) define a “straight-through” training technique, where Equation (5) is used in forward-passes, whereas Equation (6) is used for backward-passes. Straight-through training ensures that samples used in Equations (1) and (3) are discrete, but this is at the expense of a biased gradient estimator. Lastly, we can obtain deterministic samples simply by selecting the largest πj.

### 3.3 Interpretation of Markov Chains as Linear Transformers

As shown in Appendix A.1, for W0 = 0, the transition probabilities (see Equation (3)) can be rearranged into:

(7)
${P}_{t}\left({\stackrel{^}{s}}_{t}|{h}_{t-1}\right)=\sum _{{t}^{\prime }=0}^{t-1}a\left({h}_{{t}^{\prime }-1},{h}_{t-1}\right){s}_{{t}^{\prime }},$

where

(8)
$a\left({h}_{{t}^{\prime }-1},{h}_{t-1}\right)=\frac{{h}_{{t}^{\prime }-1}^{\top }{h}_{t-1}}{\sum _{{t}^{″}=0}^{t-1}{h}_{{t}^{″}-1}^{\top }{h}_{t-1}}.$

With this reformulation, we can interpret the ML estimate of transition probabilities as a causal self-attention procedure where our terminology for key (ht′–1), query (ht–1) and value (st′) follows the ordinary self-attention terminology (Vaswani et al., 2017). Therefore, Equations (7) and (8) show that a fixed-order ML estimated MC and the DCSTM are forms of linear Transformers (Katharopoulos et al., 2020) where the similarity between key code and query code is binary (i.e., either the key and query match, or they do not match):

(9)

For the case where W00 please refer to Appendix A.1.

### 3.4 Continuous Code Short-Term Model

During prediction the DCSTM matches the encoded query with encoded keys in a binary similarity condition (see Equation (9)). If learning is not optimal, it might happen that a key st′–n, …, st′–1 and a query st–n, …, st–1, which statistically predict the same value intra-opus, are assigned to different codes (i.e., ht′–1ht–1). In this case, the observation ht′–1, st′ is omitted from the ML estimate (see Equation (3)), which lowers the predictive performance. In the CCSTM, we generalize our model by replacing the one-hot encoder of the DCSTM with an encoder with differentiable non-negative outputs. Now, the similarity measure of the CCSTM resembles the ordinary linear self-attention similarity measure yielding a non-negative real number, where higher values indicate that the key and the query are more similar. This also ensures that all observations are used for prediction, i.e., the prediction is a weighted mean of all observed values, where the weight of each value is determined by the similarity between the corresponding key code and the query code.

#### 3.4.1 Encoder Network

We use an encoder

(10)
$\begin{array}{l}{h}_{{t}^{\prime }-1}={g}_{\theta }\left({s}_{{t}^{\prime }-n},\dots ,{s}_{{t}^{\prime }-1}\right)\\ =\varphi \left({\text{AutoregressiveNN}}_{\theta }\left({s}_{{t}^{\prime }-n},\dots ,{s}_{{t}^{\prime }-1}\right)\right),\end{array}$

where the activation φ is chosen as

(11)
$\varphi \left(x\right)=\text{elu}\left(x\right)+1,$

like in the work of Katharopoulos et al. (2020). However, we note that more recently, other activation functions have been proposed for this task (Choromanski et al., 2021; Schlag et al., 2021).

### 3.5 Learning Codes

The encodings can be learned offline using ordinary gradient descent techniques on mini-batches of size B. Let Nb be the length of sequence b. The (SGD) objective is given by:

(12)
$\underset{\theta }{\mathrm{min}}-\sum _{b=1}^{B}\sum _{t=0}^{{N}_{b}-1}\mathrm{log}\left({\stackrel{^}{s}}_{b,t}^{\top }{s}_{b,t}\right),$

where ŝt follows Equation (3).

### 3.6 Autoregressive Neural Network

For the autoregressive neural network (Equations (4) and (10)), any sequence model could be used, e.g., an RNN, temporal convolutional neural network, or autoregressive Transformer. In this work, we use a (time-)dilated causal CNN due to its efficiency in covering large receptive fields.

#### 3.6.1 Dilated Causal Convolution

Dilated causal convolution (van den Oord et al., 2016) performs convolution in time, thereby summarizing a context of previously observed sequence elements in a vector. Dilated causal convolution uses convolutional layers l ∈ {1, …, L} to achieve an effective receptive field of length n = 2L by using a convolutional dilation of dl = 2l–1 in each layer. Therefore, the receptive field scales exponentially with the number of convolutional layers, and dilated causal convolution can be seen as an efficient, highly non-linear variant of 1-D convolution with receptive field n. In the input layer, l = 1, we use filters in ℝd×2 to cover all pitches of the piano roll. In subsequent layers, l > 1, we use filters in ℝ1×2. Causality is achieved by zero-padding in all layers. In the output layer, we use m filters to obtain an m-dimensional code vector.

### 3.7 Zero-Frequency Problem and Prior Probabilities

If the frequency matrix at time t = 0 is zero (i.e., W0 = 0), the predictions are exclusively based on the observed (n+1)-grams within a piece. For the DCSTM, W0 = 0 also implies that whenever a query code does not match any observed key code, then the transition probabilities are undefined (i.e., the zero-frequency problem occurs). For the CCSTM, the similarities of codes and query will, in practice, be non-zero and the zero-frequency problem only occurs at t = 0. However, if the frequency matrix at time t = 0 is non-zero, then it encodes prior (unnormalized) transition probabilities. Therefore, the “learning rate” of intra-opus transition probabilities can be controlled by scaling the prior frequency matrix (see Appendix A.2). In this study, we choose as an initialization scheme a uniform prior probability and a high learning rate. This implies that our model starts the intra-opus online-learning process without any prior knowledge. This is desired because we want to evaluate our model exclusively on learned intra-opus statistics (i.e., as a short-term model). Other possible initialization schemes that could provide the model with more inter-opus information are 1) initialize the frequency matrix by an aggregated frequency matrix summed over all training pieces (i.e., letting the model “listen” to the corpus after training the encoder), and 2) parameterization of the prior frequency matrix and learning of the parameters at train time.

Note that the procedure of point 1) above could also be performed for only a limited set of pieces. Such a priming enables flexible control of prior knowledge accessible to the model. Priming, as a form of data adaptation, does not require costly stochastic gradient descent and could be used in music production, where artists could quickly adapt a model to imitate their music style. Priming could also be advantageous for simulating human listening processes, where a listener model could be primed by some pieces that should be familiar to a specific human listener. Besides its usage at test time, priming could also be employed at (meta-learning) train time, similar to using support sets in K-shot learning. Thereby, the priming data would influence the context encodings as the model would need to perform well on several pieces instead of only on one piece.

## 4. Similarity to Other Work

Pearce (2005) uses the nonparametric online learning PPM model to simulate a listening process by adjusting n-gram probabilities during test time. We instead use a parametric model (artificial neural network (ANN)), which learns context representations offline that are used by an online learning STM. The approach could therefore be regarded as meta-learning. In our setup, using a meta-learning methodology similar to Santoro et al. (2016), each composition gives rise to a meta-learning task or episode. In an episode, the model needs to predict pitch st for all t = 0, …, N–1 using a support sequence of previous pitches smax(0,t–n), …, st–1. Santoro et al. (2016) use a form of Neural Turing Machine (Graves et al., 2014) with an external memory where episode-specific information can be stored in an addressable way. This is similar to how our model stores episode-specific information in the frequency matrix. Our architecture is closely related to meta-learning Matching Networks (Vinyals et al., 2016) proposed for K-shot learning (cf. Equation (7) and (8)). Vinyals et al. (2016) predict according to Equation (7), but use a non-linear attention kernel. Unlike Santoro et al. (2016) and our setup, where the relative order of the support sequence is essential, K-shot learning uses an unordered support set of size K. Our approach differs from the approach of Vinyals et al. (2016) by using a linear attention kernel rather than a non-linear attention kernel. This should enable the learning of longer time dependencies (Katharopoulos et al., 2020). Using the interpretation of our STM as a linear attention kernel (Section 3.4), the STM can be viewed as a linear Transformer decoder layer (Katharopoulos et al., 2020), where keys and queries are transformed using gθ and values are transformed using an identity transform. This implies that techniques from Transformers can be used to improve our models, such as alleviating the time-homogeneity assumption by using relative position representations (Shaw et al., 2018; Huang et al., 2018; Liutkus et al., 2021). However, Equation (3) might also be viewed as a specific fast-weight model (Schmidhuber, 1992) where the slow-weight network (gθ) is used to generate fast-weights (Wt). The model might therefore be extended with other update rules developed for fast-weight models than the simple additive rule (Equation (1)). This includes learning an “override strength” (Schlag et al., 2021) which could make the model adapt well to non-stationary distributions. Lastly, in the discrete setting (DCSTM), we might consider our approach a state-space dimensionality reduction algorithm (Barr and Thomas, 1977; Katehakis and Smit, 2012; Kannan et al., 2020) that maximizes predictive performances of MCs fitted to a reduced state space using ML estimation.

## 5. Experiments

In this section, we describe the setup used in our experiments. In Section 5.1, we introduce the training and test data. In Section 5.2, the employed metrics are discussed. In Section 5.3, we describe the architecture and training details. Finally, in Section 5.4 we introduce the baseline models.

### 5.1 Data

We use lead sheets of Irish folk music in ABC notation4 contributed by the online community “The Session”.5 Specifically, we use the pre-processed MIDI re-encoded dataset of Sturm et al. (2016). The dataset is especially suited for our task as it comprises many different monophonic folk pieces (45849 pieces mostly in 4/4, 3/4, and 6/8 time signature) with a simple structure often containing repeated sections. This provides well-posed conditions for learning context encodings that are suited for intra-opus online learning. The dataset contains MIDI pitches in the range pitch = 48,49, …, 95. We use pitch = 47 to indicate rest. We partition the dataset in training, validation, and test set with proportions 10/12,1/12,1/12, respectively. The dataset contains pieces with the same name, corresponding to different versions of the same tune. We ensure that tunes with the same name appear in exactly one set. We then construct piano rolls with a time resolution of 1/16th-notes.

### 5.2 Metrics

We compared the performance of our models using negative log-likelihood (NLL) (i.e., cross-entropy) and precision, measured as the ratio of correct predictions. When computing the NLL, we add a constant of 0.001 for events that have zero probability.

### 5.3 Architecture Details and Training

We implemented the models as described in Sections 3.2 and 3.4 using the auto-differentiation tool PyTorch (Paszke et al., 2019). Contrary to the gated activation units used by van den Oord et al. (2016), we use the SELU non-linearity (Klambauer et al., 2017) and a dropout (Srivastava et al., 2014) of 0.1 in all layers. We employ a rectangular-shaped convolutional network (i.e., 512 filters in all layers) with L = 9 layers, resulting in a receptive field of n = 512, or 32 measures in 4/4 time signature. The models were trained using the ADAM optimizer (Kingma and Ba, 2015) with gradient clipping, a learning rate of η = 10–4, and a batch size of 16 pieces. We assumed that the models converged if no improvement was achieved for three successive epochs.

GumbelSoftmax For the DCSTM, we found that “straight-through” training (see Section 3.2.1) led to poor results. Therefore, we used GumbelSoftmax (see Equation (6)) during training, which yields approximately one-hot encoded samples. For evaluation, we compute deterministic one-hot samples by setting the maximum value to one and the others to zero. We employ the following linear annealing scheme during training:

(13)
$\tau =\mathrm{max}\left(0.8,-9.5\cdot {10}^{-5}\cdot step+1.5\right)$

where step is the gradient step number.

### 5.4 Baselines

We compare our online learning models DCSTM and CCSTM against the online learning baselines: 1) a 3rd-order MC (see Section 3.1), 2) the PPM model (see Section 2), 3) the IDyOM model in short-term model setting with optimal viewpoint selection (Conklin and Witten, 1995) and 4) a model that predicts the last note (Repetition). Details on the baseline settings are outlined in Appendix B. We also compare our models to LTMs: firstly, a WaveNet-style CNN (WaveNet-512, see Section 3.6), where the output of the network is used with softmax to predict pitches; and secondly, a Transformer with relative positional encoding (Shaw et al., 2018; Huang et al., 2018) (Transformer-512), due to the resemblance of our attention mechanism to the self-attention mechanism of autoregressive Transformer models (see Section 4). To keep comparisons fair, we ensure that the LTMs have the same receptive field (n = 512) as our DSTMs and that the models have approximately the same number of parameters (measured in bytes). This results in a Transformer with 21 layers. Finally, we also test the effect of the receptive field size. To that end, we train a CCSTM and Transformer with a small receptive field n = 32, corresponding to 2 measures of 4/4 time signature. The corresponding models are reported as CCSTM-32 and Transformer-32, the latter model having ten layers.

## 6. Results and Discussion

In this section, we present and discuss the results of the model evaluations. The main evaluations consist of computing the precision and the NLL of the proposed models and the baselines. In Section 6.1 we provide the overall predictive performances. In Section 6.2, we show how the model performances vary over time within the pieces. To gain insights into the operations of the DSTMs, we perform additional evaluations on those models. In Section 6.3, a qualitative analysis is given, showing correct and incorrect predictions within a single piece. In Section 6.4 we present and discuss confusion matrices of the proposed models. A sensitivity analysis is provided in Section 6.5, and in Section 6.6, we analyze how contexts are assigned to codes and how codes are re-used in different pieces. Furthermore, in Section 6.7 and Section 6.8, we present the pitch- and time-signature imbalances of the dataset and contrast them with the model performances.

### 6.1 Mean Performance

The mean performances (averaged over the whole test set) of our DSTMs are summarized in Table 1, along with the performances of our baselines. Note that IDyOM does not use piano roll representations but multiple viewpoint representations that encode note durations using inter-onset intervals. Therefore, its performance is not directly comparable with our models. To make the models comparable, we consider in the CCSTM-512 and in IDyOM only those time steps (and notes, respectively) where the pitch changes (see Table 1 EVT). The CCSTM-512 outperformed all other STMs, but performed worse than the WaveNet-512 and the Transformer-512. As elaborated in Section 6.2.2, the WaveNet-512 only outperforms the CCSTM-512 at the beginning of pieces, where the CCSTM-512 could not yet adapt to the intra-opus statistics (see Figure 6.2). The Transformer-512 outperforms the CCSTM-512 consistently. However, for the variants with short receptive fields, the CCSTM-32 model outperforms the Transformer-32. An explanation for the CCSTM-32 model outperforming the Transformer-32 could be the nature of the problem (i.e., learning data with clear, repetitive structure). Thus, for many time steps, the problem resembles a copy task. On this task, a model can perform well if the to-be-copied data is contained in its receptive field (as for the Transformer-512). Otherwise, (as for the Transformer-32), the STMs perform better simply because the fast weights contain information of events before those contained within the receptive field. The DCSTM-512 performs considerably worse than the CCSTM-512 but has similar performance to the PPM model, with slightly better NLL but worse precision. The inferior performance of the DCSTM-512, as opposed to the CCSTM-512, is mainly caused by the zero-frequency problem (see Section 3.7), which does not affect the CCSTM-512 to the same extent. Furthermore, the continuous code vectors can have multiple non-zero elements, which allow the encodings to be composed of sub-features and, when combined with the similarity measure, give a more expressive mechanism for modeling (dis-)similarities between contexts. The CCSTM-512 outperforms IDyOM considerable. Therefore, extending PPM with carefully selected input features, does not outperform the CCSTM-512. In the following, we evaluate solely on piano roll based models.

Table 1

NLL and precision of the DCSTM, the CCSTM (our models), and the baselines, categorized in short-term (STM) and long-term (LTM) models. The length of the respective temporal context is denoted by n. * denotes that the maximal context is used. EVT means that the performance was measured only at time steps where the pitch changes.

TYPE NAME N NLL PRECISION STM CCSTM-512 512 0.574 0.848 CCSTM-32 32 0.733 0.783 DCSTM-512 512 0.792 0.781 MC-3 3 1.922 0.606 PPM * 1.387 0.798 Repetition 1 2.724 0.606 LTM WaveNet-512 512 0.502 0.849 Transformer-512 512 0.370 0.887 Transformer-32 32 0.852 0.718 EVT CCSTM-512 512 1.237 0.682 IDyOM * 1.870 0.426

### 6.2 Online Learning

To gain insights into the intra-opus learning, we inspect how the predictive performances of the models evolve throughout listening to individual pieces of music. More specifically, we visualize the precision and NLL (“information content”) of the models at different time steps within the pieces. To that end, we compute the mean of the respective metric at time step t of all compositions in the test set with length ≥ t. We found that the performances of all models vary, with odd time steps being easier to predict than even time steps. The phenomenon is caused by the piano roll representation, where longer notes that span several time steps tend to start at even time steps and repeat at odd time steps. Such pitch repetitions are easier to predict. For improved display, each value in Figure 2a and Figure 2b is computed from the mean of successive even and odd time steps. We show the plots up to t = 1000 since only a few pieces are longer than that.

Figure 2

Negative log-likelihood (NLL) and precision as functions of time-steps in intra-opus prediction, averaged over pieces of the test set.

Figure 2 shows significant performance drops for all models at t ∈ {192,256,512,576,768}. These time steps account for {12,16,32,48} measures of 4/4-time signature and {16,48,64} measures of 3/4-time signature, which are divisible by 4. In the used dataset, coherent segments often have a length of (multiples of) 4 measures (e.g., in Figure 3 the segments A and B consist of four measures). Such drops in precision are usually caused by novel segments being introduced in the compositions, which has also been exploited in previous works on melody segmentation (Pearce et al., 2010; Lattner et al., 2015a, b).

Figure 3

Prediction of DSTMs on A Scone For Breakfast (sessiontune157) from the test dataset. Green indicates the actual pitch, and red indicates a prediction error.

#### 6.2.1 Short-Term Models

The NLL is substantially lower for the CCSTM-512 model compared to the other STMs (see Figure 2a), which is most likely caused by the zero-frequency problem of the DCSTM-512 and MC-3, and the ability to perform a “soft-assignment” of keys and queries to code vector entries by the CCSTM, as discussed at the end of Section 6.1. It is interesting to see that this performance gap is not as substantial for precision. In general, the CCSTM-512 has higher precision than the PPM model at the beginning of pieces and after new segments are introduced. This indicates that the CCSTM-512 adapts faster to changes in the non-stationary latent data distribution than the PPM model. When the distribution is established, the PPM model catches up. This might yet again be due to the zero-frequency problem, which is less severe when more observations have been made. The fact that the CCSTM-512 outperforms the PPM model so drastically in the NLL metric is most likely again due to the soft-assignment of keys and queries. Even when the maximum predicted value of the CCSTM-512 model is incorrect, some probability mass is still assigned to the correct value, which is not trivial for the PPM model. This phenomenon affecting the CCSTM model could even be caused by some long-term knowledge leaking into the intra-opus learning (i.e., by learning that different keys within a piece yield the same value).

#### 6.2.2 Short-Term Models vs. Long-Term Models

When comparing the Short-Term Model CCSTM-512 with the long-term model WaveNet-512, it can be seen that in the first time steps, the WaveNet-512 performs better than the CCSTM-512 in terms of NLL and precision. This is expected behavior that applies to all LTMs and STMs as seen in Figures 2a and 2b. The WaveNet-512 fits the intra-opus latent distribution, which is useful to predictions before the intra-opus distribution has been established. After several time steps, the CCSTM-512 performs better than the WaveNet-512, as its internal memory (i.e., the fast-weight matrix Wt) already contains sufficient information about the intra-opus distribution. Note that choosing other prior distributions (like those discussed in Section 3.7) could mitigate the initial low performance and improve the overall performance of the DSTMs.

#### 6.2.3 Length of the Receptive Field

By comparing the results of the CCSTM-512 (64 measures in time signature 4/4) with the CCSTM-32 (2 measures in time signature 4/4), the model with the longer receptive field naturally outperforms the model with a shorter receptive field. However, when comparing the performance of the CCSTM-32 with the Transformer-32, we find that after t > 64, the CCSTM-32 outperforms the Transformer-32. The Transformer-32 does not improve its performance over time. This is because the Transformer-32 can only access information of the previous two measures, which is insufficient to cover segment repetitions. In contrast, the CCSTM-32 can gather long-term information in the frequency matrix Wt. Note that this is a critical advantage of the proposed short-term models over the transformer. At test time, the receptive field can be smaller while still incorporating information from a more distant past, leading to more efficient computation.

### 6.3 Prediction Errors (Qualitatively)

In Figure 3 we depict the predictions of the CCSTM-512 and the CCSTM-512 on a single piece from the test set. Considering the overall form AABB, there are almost no prediction errors in the repetitions of the sections A and B. The prediction errors of the CCSTM-512 (see Figure 3a) in the first occurrence of the respective sections are understandable, as the model had too few stimuli available to learn the corresponding distributions.

The DCSTM-512 (see Figure 3b) often erroneously predicts pitch = 47, in the first occurrences of segments A and B. This is caused by the zero-frequency problem, which occurs whenever a new distinct code is introduced. When the zero-frequency problem occurs, the prior uniform distribution is predicted. In such a case, we chose in our implementation to deterministically predict pitch = 47. In the first occurrences of A and B, the zero-frequency problem occurs when a note is repeated. This shows that the DCSTM-512 poorly maps similar short contexts to the same code (as a length-one context would be enough to predict a repeated note correctly). The fact that only a few prediction errors occur in the repetitions of segments A and B shows that the DCSTM-512 maps similar longer contexts to the same code more confidently.

### 6.4 Confusion Matrices

In Figure 4 we report the confusion of the CCSTM-512 and the DCSTM-512, normalized by the number of occurrences of the respective pitches. The confusion matrices for both models showed some common tendencies. The confusion was higher for relatively close pitches (according to the ordering of the MIDI notes). This is desirable as we, in general, expect that composed melodies consist of small intervals. For the special class pitch = 47, indicating a rest, we found more confusion, shown by a high frequency of both false positives (leftmost column) and false negatives (bottom row). A rest has a different spatial meaning than the other class labels, e.g., the tonality of a piece cannot predict it. Therefore, it might be more challenging for the CNN-encoder to learn meaningful features for rests. This could lead to a relatively higher frequency of false negatives for pitch = 47. The relatively high frequency of false positives for pitch = 47 accounts for cases of the zero-frequency problem. This is more dominant for the DCSTM-512 as compared to the CCSTM-512, which quantifies the findings of Section 6.3.

Figure 4

Confusion matrices for DSTMs.

The horizontal and vertical lines in the confusion matrices indicate false positives and false negatives for the pitch classes C and F. These pitch classes are the most predominant in the corpus, as shown in the histogram in Figure 8. Therefore, the models get biased towards predicting these pitches (accounting for the false positives). When they make an erroneous prediction, they most likely missed one of those pitches (accounting for the false negatives).

### 6.5 Input Sensitivity

We inspected the sensitivity of the DSTMs using saliency maps (Simonyan et al., 2014) aggregated over our test set. The saliency map visualizes the corresponding query in a piano roll for a given prediction at a specific time step. The intensity of the piano roll element (i,nj) indicates how sensitive the prediction is to the presence of pitch i at the time lag j in the query (for details on how the saliency maps are computed, please refer to Appendix C). To obtain an overview, we aggregated the saliency maps over the entire test set. For each test composition, we computed the gradients of 10 randomly sampled time steps and translated each map in pitch and time relative to the corresponding sampled outputs’ pitch and time. For instance, if the sampled output is pitch = i, t = j then element (di,n – Δj) of the translated map gives the partial derivative of (sj–Δ j)i+Δ i, i.e., the partial derivative of the input that is Δi semitones away from i at time Δj prior to j. The translated maps were finally summed to obtain the aggregated maps (see Figure 5).

Figure 5

Aggregate saliency maps for DSTMs. Pixel intensities indicate how important variables are for prediction on average.

In the time dimension, we observe high intensities at time lag = 48, 96, 192, 384 for both models. These time lags correspond to 4,8, 16,32 measures in 3/4-time signature. Furthermore, high intensities are found at time lag = 64, 128, 256, which corresponds to 4,8, 16 measures in 4/4-time signature. In the used dataset, coherent segments with a length of 4 measures are prevalent (see Figure 3). The measure counts corresponding to the found important lags are all divisible by 4. This shows that the DSTMs pay attention to events at time lags which are important to the musical form. However, we also observe high intensities of events at small time lags for both models, with the CCSTM-512 having intensity magnitudes larger than the intensity magnitudes of the DCSTM-512. Therefore, in addition to the model’s awareness of musical form, the models also learned what PPM assumes: recent events are particularly discriminative when matching key and query.

In the pitch interval dimension, the DSTMs are overall more sensitive to small intervals. At small time lags and at the musical form time lags, the DSTMs are sensitive to slightly larger intervals. The DCSTM-512 is particularly insensitive to the perfect unison (i.e., the pitch interval 0). This quantitatively reflects the findings of Section 6.3: the DCSTM-512 poorly predicts repeated notes. The fact that the DSTMs focus on all pitch intervals (except the unison) equally shows that all intervals are essential for matching keys and queries. However, apparently, the unison is less discriminative in this regard.

### 6.6 Code Assignment

To gain insights into the relationship between context encoding and prediction, we computed the similarity between codes predicting different pitches. More precisely, in Figure 6 we report similarity matrices, where the value at index (i, j) indicates the mean similarity of code pairs predicting pitches i, j. For example, in a fully diagonalized similarity matrix, every code would only predict one particular pitch in the examined data set. As the code space dimensionality limits the number of patterns that can be stored in code vectors, codes could be re-used for different predictions (i.e., similar codes could predict different pitches). We visualize this by opposing the similarity matrix of both intra-opus and intra-opus data.

Figure 6

Similarity of codes grouped by their pitch value.

To that end, for the intra-opus case, we sampled 5000 key/value pairs per distinct pitch in the test dataset. This was done for the DCSTM-512 and the CCSTM-512, and is reported in Figure 6a and Figure 6c, respectively. For the intra-opus case, we computed the similarities only for codes generated within the same piece (see Figure 6b and Figure 6d).

The intra-opus code-similarity matrices have high-intensity diagonals, indicating that contexts followed by the same pitch are encoded into similar codes. In contrast, the intra-opus matrices show squares along the diagonal, indicating that codes are re-used for contexts that precede pitches and are a semitone apart. Since this rarely happens in the intra-opus case, the DSTMs learned to re-use codes only when the code re-use does not lead to intra-opus prediction confusion.

It is interesting that re-used codes mostly predict pitches a semitone apart. However, this is understandable considering the following hypothesis. The encoders can assign similar codes to contexts that are most similar but least likely to occur together within a composition. Such contexts could be typical motives in different scales. Considering a motive in different scales, different pitches would mostly differ by one semitone. In the used dataset, there are rarely any intra-opus scale changes. Therefore, such similar motives can be encoded into similar codes between pieces (and in different scales). The continuation (and prediction) may then differ mainly by one semitone.

### 6.7 Pitch Imbalance

In Figure 7 we report the pitch distribution (histogram) of our training set and find that the training set has pitch imbalances. We investigated the precision for each pitch, which is reported in Figure 7 along with 95% confidence intervals. The models showed a decreasing trend for higher pitches (though the uncertainty was high). The decreasing trend for higher pitches coincides with these pitches being less frequent in the training set. For the DSTMs, a similar trend was found for the infrequent low pitches. The effect of pitch frequencies on the precision was not limited to the low and high ranges, as, for instance, pitch = 71 was more challenging to predict than pitch = 72 for all models. The nonparametric models (the PPM model and the MC-3) did not perform significantly different than the parametric models.

Figure 7

Precision computed for each pitch in the test set and the histogram of pitches in the training data set.

### 6.8 Time Signature Imbalance

For different time signatures (e.g., 3/4 and 4/4), we expect that the DSTMs need to learn various context features. Furthermore, the performance might be influenced by how frequently a time signature occurs in the training set. Therefore, we investigated how the time signature of the pieces affects the performance.

We computed the ratio of correct predictions (grouped by the respective time signatures) and the train set distribution (histogram) of the time signatures (see Figure 8). It is interesting to see that the performance of all models is approximately constant for different time signatures, even though there exists a strong imbalance in the data set (with the discrete models PPM, DCSTM, and MC-3 having a slight performance drop in the 3/2 and 12/8 time signatures). Apparently, the continuous models are general enough to deal with such types of variances.

Figure 8

Precision computed for each time signature in the test set and the histogram of time signatures in the training data set.

## 7. Conclusion

Online learning models are faced with a trade-off between model order and data scarcity. Variable-order context models, such as PPM, address the challenge by defining implicit rules that show efficiency experimentally, but the rules are not always musically justified.

We instead proposed a method that learns similar context encodings if they provide similar predictions by an online learning short-term model (STM). Thereby, we posed our task as a meta-learning problem of learning to online-learn intra-opus predictions. We introduced two models, the DCSTM and the CCSTM, both of which use WaveNet-style encoders and STMs. We initially defined the DCSTM as a 1st-order MC in an encoded space and showed that this model (as well as ordinary fixed-order MCs) is equivalent to a linear-kernel causal self-attention mechanism. We then derived the Differentiable Short-Term Model (DSTM) by replacing the one-hot encoder with a continuous encoder. We tested the models on “The Session” dataset, a corpus of monophonic non-expressive Irish Folk tunes with compositions possessing short, repetitive structure. In this evaluation, the CCSTM-512 outperformed all other STMs in terms of precision and even more substantially in terms of NLL. It was also shown that the CCSTM-512 adapts faster to changes in the data distribution than PPM. Likely reasons for the better performance of the CCSTM over PPM are, firstly, that the CCSTM does not rely on exact matches between n-grams and that the CCSTM learns to attend to musically important time intervals.

The CCSTM-512 performed worse than the comparable long-term model (LTM) WaveNet-512 on the first time steps, but better after adapting to the intra-opus distribution. When the receptive field was sufficiently large, the CCSTM had lower predictive performance than the Transformer (which is an LTM). However, when the receptive field was small, the CCSTM outperformed the Transformer due to its external memory (i.e., the frequency matrix). The DCSTM-512 performed considerably worse than the continuous variant, slightly better than PPM in NLL, and slightly worse than PPM in precision. We found that the worse performance of the DCSTM-512 compared to the CCSTM-512 was partly due to the zero-frequency problem, occurring when the DCSTM-512 generates new distinct, discrete codes.

In summary, the proposed DSTM is a step towards explicit, meta-learning based differentiable short-term memories. As the encoder is not limited to univariate time series, we believe this concept is promising for online learning of more complex time series, like musical audio or self-similar image data (e.g., textures). In principle, the model could be fitted to any non-stationary distribution. A further generalization of the model involves generating the target space st by encoding (n+1)-grams instead of using the data space directly as prediction targets. That way, the model could reduce the “inverse” zero-frequency problem, characterized by the model’s inability to predict any target value that was not observed before. However, that would introduce more long-term knowledge in the online-learning STM, which would also involve less flexible adaption to distributions different from the training data distribution. A general objective for online-learning models should therefore be to maximize both the predictive performance on a data set and on out-of-distribution data. Further improvements to the models involve the use of multi-head context encodings and more sophisticated update rules for the frequency matrix, including (controlled) forgetting.

In future work, we want to investigate how to best combine long- and short-term architectures to create better models of cognition. Section 6.2 argued that models with very different architectures can identify segment boundaries using information content. We intend to pose testable hypotheses in music using Shannon’s Information Theory (Shannon, 1948). Possible hypotheses are that musical events are generally “typical”, or that the information rate develops during the course of a music piece in a predictable manner. Finally, we intend to test if our results transfer to audio using discrete audio encodings.

Appendices

Appendix A to C. DOI: https://doi.org/10.5334/tismir.123.s1

## Notes

1A note on terminology: in this paper, intra-opus will refer to stylistic/structural regularities within a specific piece – such as, e.g., the specific motivic repetition and development structure or its higher-level form. In contrast, we will use the terms inter-opus to emphasize that some patterns or regularities were learned from, or seem to hold across, several pieces within a “background corpus” of other pieces (presumably of the same style).

2Note on terminology: the term context will refer to a sequence of musical events, or a generalized representation thereof, that precede the current point in the piece and are used for conditioning the next prediction.

3The terms query, key, and value are used in the terminology of attention models as an analogy to concepts of addressable dictionaries (Vaswani et al., 2017). Values are stored in the dictionary with a corresponding key address. The values can be retrieved using a query that matches the corresponding key.

## Acknowledgements

The work leading to these results was conducted in a collaboration between JKU and Sony Computer Science Laboratories Paris under a research agreement. Additional support was provided by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, grant agreements 670035 (“Con Espressione”) and 101019375 (“Whither Music?”).

## Competing Interests

The authors have no competing interests to declare.

## References

1. Barr, D. R., and Thomas, M. U. (1977). An eigenvector condition for Markov chain lumpability. Operations Research, 25(6):1028–1031. DOI: https://doi.org/10.1287/opre.25.6.1028

2. Brooks, F. P., Hopkins, A., Neumann, P. G., and Wright, W. V. (1957). An experiment in musical composition. IRE Transactions on Electronic Computers, EC-6(3):175–182. DOI: https://doi.org/10.1109/TEC.1957.5222016

3. Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. (2021). Rethinking attention with performers. In 9th International Conference on Learning Representations (ICLR).

4. Cleary, J., and Witten, I. (1984). Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396–402. DOI: https://doi.org/10.1109/TCOM.1984.1096090

5. Conklin, D., and Witten, I. H. (1995). Multiple viewpoint systems for music prediction. Journal of New Music Research, 24(1):51–73. DOI: https://doi.org/10.1080/09298219508570672

6. Cover, T. M., and Thomas, J. A. (2005). Entropy, Relative Entropy, and Mutual Information, chapter 2, pages 13–55. John Wiley & Sons, Ltd. DOI: https://doi.org/10.1002/047174882X.ch2

7. Eck, D., and Schmidhuber, J. (2002). Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pages 747–756. IEEE. DOI: https://doi.org/10.1109/NNSP.2002.1030094

8. Egermann, H., Pearce, M. T., Wiggins, G. A., and McAdams, S. (2013). Probabilistic models of expectation violation predict psychophysiological emotional responses to live concert music. Cognitive, Affective, & Behavioral Neuroscience, 13(3):533–553. DOI: https://doi.org/10.3758/s13415-013-0161-y

9. Graves, A., Wayne, G., and Danihelka, I. (2014). Neural Turing machines. CoRR, abs/1410.5401.

10. Gumbel, E. J. (1935). Les valeurs extremes des distributions statistiques. Annales de l’institut Henri Poincare, 5(2):115–158.

11. Huang, C. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A. M., Hoffman, M. D., and Eck, D. (2018). An improved relative self-attention mechanism for transformer with application to music generation. CoRR, abs/1809.04281.

12. Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with Gumbel-softmax. In 5th International Conference on Learning Representations (ICLR). OpenReview.net.

13. Kannan, D., Sharpe, D. J., Swinburne, T. D., and Wales, D. J. (2020). Optimal dimensionality reduction of Markov chains using graph transformation. The Journal of Chemical Physics, 153(24):244108. DOI: https://doi.org/10.1063/5.0025174

14. Katehakis, M. N., and Smit, L. C. (2012). A successive lumping procedure for a class of markov chains. Probability in the Engineering and Informational Sciences, 26(4):483–508. DOI: https://doi.org/10.1017/S0269964812000150

15. Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR.

16. Kingma, D. P., and Ba, J. (2015). Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y., editors, 3rd International Conference on Learning Representations, (ICLR).

17. Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). Self-normalizing neural networks. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R., editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pages 971–980.

18. Lattner, S., Chacon, C. E. C., and Grachten, M. (2015a). Pseudo-supervised training improves unsupervised melody segmentation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 2459–2465.

19. Lattner, S., Grachten, M., Agres, K., and Chacon, C. E. C. (2015b). Probabilistic segmentation of musical sequences using restricted Boltzmann machines. In Proceedings of the 5th International Conference on Mathematics and Computation in Music (MCM), pages 323–334. DOI: https://doi.org/10.1007/978-3-319-20603-5_33

20. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551. DOI: https://doi.org/10.1162/neco.1989.1.4.541

21. Liutkus, A., Cifka, O., Wu, S.-L., Simsekli, U., Yang, Y.-H., and Richard, G. (2021). Relative positional encoding for transformers with linear complexity. In International Conference on Machine Learning, pages 7067–7079. PMLR.

22. Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. In 5th International Conference on Learning Representations (ICLR). OpenReview.net.

23. Maddison, C. J., Tarlow, D., and Minka, T. (2014). A* sampling. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., andWeinberger, K. Q., editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, pages 3086–3094.

24. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). PyTorch: An imperative style, highperformance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.

25. Pearce, M. (2005). The Construction and Evaluation of Statistical Models of Melodic Structure in Music Perception and Composition. PhD thesis, Department of Computing, City University, London, UK.

26. Pearce, M., and Mullensiefen, D. (2017). Compressionbased modelling of musical similarity perception. Journal of New Music Research, 46(2):135–155. DOI: https://doi.org/10.1080/09298215.2017.1305419

27. Pearce, M. T., Mullensiefen, D., and Wiggins, G. A. (2010). Melodic grouping in music information retrieval: New methods and applications. In Advances in Music Information Retrieval, pages 364–388. Springer. DOI: https://doi.org/10.1007/978-3-642-11674-2_16

28. Pinkerton, R. C. (1956). Information theory and melody. Scientific American, 194(2):77–87. DOI: https://doi.org/10.1038/scientificamerican0256-77

29. Quastler, H. (1955). Discussion, following mathematical theory of word formation, by W. Fucks. In Information Theory: Third London Symposium, volume 168. New York: Academic Press.

30. Roberts, M. G. (1982). Local Order Estimating Markovian Analysis for Noiseless Source Coding and Authorship Identification. PhD thesis, Stanford University.

31. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323(6088):533–536. DOI: https://doi.org/10.1038/323533a0

32. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. (2016). Meta-learning with memory-augmented neural networks. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1842–1850, New York, New York, USA. PMLR.

33. Schlag, I., Irie, K., and Schmidhuber, J. (2021). Linear transformers are secretly fast weight programmers. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 9355–9366. PMLR.

34. Schmidhuber, J. (1992). Learning to control fastweight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139. DOI: https://doi.org/10.1162/neco.1992.4.1.131

35. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423. DOI: https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

36. Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Selfattention with relative position representations. In Walker, M. A., Ji, H., and Stent, A., editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), volume 2, pages 464–468. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/N18-2074

37. Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In Bengio, Y. and LeCun, Y., editors, 2nd International Conference on Learning Representations (ICLR), Workshop Track Proceedings.

38. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.

39. Sturm, B. L., Santos, J. F., Ben-Tal, O., and Korshunova, I. (2016). Music transcription modelling and composition using deep learning. In Proceedings of the Conference on Computer Simulation of Musical Creativity.

40. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. In 9th ISCA Speech Synthesis Workshop, page 125. ISCA.

41. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc.

42. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016). Matching networks for one shot learning. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, pages 3630–3638.

43. Witten, I., and Bell, T. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37:1085–1094. DOI: https://doi.org/10.1109/18.87000

44. Yang, L., Chou, S., and Yang, Y. (2017). MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. In Cunningham, S. J., Duan, Z., Hu, X., and Turnbull, D., editors, Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 324–331.