A- A+
Alt. Display

# Steerable Music Generation which Satisfies Long-Range Dependency Constraints

## Abstract

Although music is full of repetitive motifs and themes, artificially intelligent temporal sequence models have yet to demonstrate the ability to model or generate musical compositions that satisfy steerable, long-range constraints needed to evoke such repetitions. Markovian approaches inherently assume a strictly limited range of memory while neural approaches—despite recent advances in evoking long-range dependencies—remain largely unsteerable. More recent models have been published that attempt to evoke repetitive motifs by imposing unary constraints at intervals or by collating copies of musical segments. Although the results of these methods satisfy long-range dependencies, they come with significant—potentially prohibitive—sacrifices in the musical coherence of the generated composition or in the breadth of satisfying compositions which the model can create. We present REGUALR non-homogeneous Markov models as a solution to the long-range dependency problem which uses RELATIONAL automata to enforce binary constraints to compose music with repeating motifs. The solution we present preserves musical coherence (i.e., Markovian constraints) for the duration of the generated compositions and significantly increases the range of satisfying compositions that can be generated.

Keywords:
How to Cite: Bodily, P. and Ventura, D., 2022. Steerable Music Generation which Satisfies Long-Range Dependency Constraints. Transactions of the International Society for Music Information Retrieval, 5(1), pp.71–86. DOI: http://doi.org/10.5334/tismir.97
Published on 25 Mar 2022
Accepted on 11 Feb 2022            Submitted on 28 Feb 2021

## 1 Introduction

Music is full of long-range dependencies. Examples include themes that recur at regular or irregular intervals; motifs across pitch, rhythm, and harmonic progression; and lengths of segments and phrases. The repetitive nature of music contributes to the the fluency with which a song is processed by the human mind and has been linked to the success with which music earns popularity and renown (Nunes et al., 2015).

Despite the significance of such long-range dependencies, nearly all computational models for algorithmic music composition exhibit significant limitations in their ability to model such dependencies. Resolving this discrepancy between the long-range dependencies in human-composed music and the lack thereof in computationally-composed music has been considered one of the foremost challenges for the next decade of music information research (Widmer, 2016). In short, the question of how to design a computational model for composing music demonstrating long-range dependencies remains an open and challenging problem for AI researchers.

Although several models of composition have been proposed, they generally fall into one of a few common types. Neural approaches constitute one popular approach to sequence generation. Long short-term memory (LSTM) models reliably produce unique and musically-coherent sequences (Oore et al., 2020) but are known to have issues of decaying memory that make predicting musical events and patterns over longer timespans problematic (Hadjeres and Nielsen, 2020). The more recent advent of transformers has made significant breakthroughs generating music with clear long-range dependencies (Huang et al., 2018). The major challenge with neural approaches, however, has been their lack of steerability and explainability, attributes which for many applications of computer-generated music (including those with which we concern ourselves) remain essential and of primary interest (Dathathri et al. 2019; Louie et al. 2020).

Markovian models represent a second common type of music composition model. Markov models are a class of probabilistic sequence models that fundamentally assume the composition of a sequence depends strictly on the immediate context, without regard for what precedes that context (Schulze and Van Der Merwe, 2011). Recent work has focused on imposing steerable, long-range dependencies in Markov models through the incorporation of constraints. Non-homogeneous Markov models (also called constrained Markov models) have demonstrated the ability to enforce unary constraints during the generative process, allowing some degree of control in generating sequences (Pachet et al., 2011). Barbieri et al. (2012) demonstrate that by pre-defining sets of unary constraints to have matching constraint criteria, sequences can be generated that evoke a sense of long-range dependency. This method works well in one sense, but the necessity of pre-defining the matching constraint criteria severely limits the results that can be obtained from the model.

Others have focused on the specific case of imposing regular constraints in sequence generation. Factor graph models impose regular constraints by combining an automaton representing regular constraints with a Markov model (Papadopoulos et al., 2015). Multi-valued Decision Diagrams (MDDs) can be used to model a combination of Markov and regular constraints (Perez and Régin, 2017). To the extent that patterns of musical repetition can be represented as regular constraints, these models represent effective solutions.

Despite numerous efforts, methods have thus far not been found to impose constraints to control generation in deep learning and neural architectures (Briot and Pachet, 2020; Hadjeres and Nielsen, 2020).

Besides constraint-driven approaches, another common approach to creating long-range repetitive and phrasal structure in music is using a form of “copy-paste”, generating musical subsequences that are then duplicated at intervals (Collins and Laney, 2017). This approach is fundamentally non-Markovian and commonly leads to unnatural transitions at splice sites. Pachet et al. (2017) present a similar “copy-paste” method for creating structured music lead sheets but use a controlled variation mechanism to ensure that copies are contextually situated to satisfy Markovian properties.

We present a novel solution to address the long-range dependency problem. To our knowledge, this is the first method which has been demonstrated to generate compositions according to Markov probabilities—ensuring a local cohesion throughout the sequence— with steerable, long-range, dynamic dependencies, allowing the imposition of wide-spread motifs and repetitions. We demonstrate how the model is capable of generating music that simultaneously exhibits a variety of patterns and motifs in pitch, rhythm, harmony, and lyrics.

Our approach involves the combination of nonunary constraints with a Markov model to allow the enforcement of long-range dependencies. Our presentation will be divided into two sections. In the first section, we demonstrate how nonunary constraints are effectuated within the context of the Markov window. Dependencies in such a limited context can hardly be considered long-range, but resolving how nonunary constraints are handled within the Markov window is a necessary complement to considering their implementation beyond the Markov window and provides opportunity to demonstrate on a smaller scale some powerful applications of nonunary constraints. In the second section, we demonstrate the implementation of nonunary constraints spanning arbitrary ranges beyond the Markov window.1

## 2. Generating music with steerable short-range dependencies

Although our main purpose is ultimately to define a model capable of enforcing long-range dependencies, any consideration of constraints in the context of Markov models will require an understanding of how constraints behave in the short-range, namely within the range defined by the Markov window. As such, we treat the issue of short-range dependencies first. It is worth noting that the method for implementing such short-range dependencies—aside from being more computationally efficient than the method for implementing dependencies beyond the Markov window—is useful on its own as a model for enforcing nonunary constraints independent of whether long-range dependencies are also added.

The implementation and application of relational constraints (which we define below) has not yet been carefully examined in the context of Markov models. In our experience in implementing such constraints, we happened upon some unique applications of these constraints that we had not initially thought of that are worthy of mention. These are also discussed in this section.

### 2.1 Constraints, Markov models defined

In speaking of long- and short-range dependencies in a sequence of music, we can formally define a dependency between two (or more) points in terms of what in set theory are called relations. An n-ary relation ρ defined over sets Σ1, , Σn is a subset of the Cartesian product of Σ1 × × Σn, or in other words a set of tuples (x1, …, xn) such that xi ∈ Σi for i ∈ [1, n]. We will deal primarily with n-ary relations where n = 1 and n = 2, called unary and binary relations respectively. In general Σ1, , Σn can be different sets, but where we apply these relations to sequences of elements deriving from the same set, we assume relations over a single set Σ. Examples of unary relations might include the set {x | x is a note in a C major chord} or {x | x is a word rhyming with ‘today’}. Examples of binary relations might include the set {(x1,x2) | x1 and x2 are notes of the same pitch} or {(x1,x2)| x1 is a word that rhymes with x2}.

Given a sequence of random variables X = (X1, …, Xn) we define a unary relational constraint or simply unary constraint (Xi) as a constraint that is satisfied if and only if Xi = xi and (xi) ∈ ρ. We similarly define a binary relational constraint or simply binary constraint (Xi,Xj) as a constraint that is satisfied if and only if Xi = xi, Xj = xj, and (xi,xj) ∈ ρ. We assume an analogous definition for an n-ary relational constraint. A set of constraints is satisfied if and only if all constraints in the set are satisfied. An example of a unary constraint is (X3,{x | x is a verb}).

A Markov model M = (Σ,Pi,PT) consists of an alphabet of tokens Σ, a set of initial probabilities Pi: Σ [0.0,1.0] such that Σx∈Σ Pi(x) = 1.0, and a set of transition probabilities PT: Σ × Σ → [0.0,1.0] such that ∀ x ∈ Σ, Σy∈ΣPT(x, y) = 1.0. The probability according to M of a sequence s = (s0, …, sn) is

(1)
${P}_{M}\left(s\right)={P}_{I}\left({s}_{0}\right)\prod _{i=1}^{n}{P}_{T}\left({s}_{i–1}, {s}_{i}\right).$

As defined by Pachet et al. (2011), a finite-length non-homogeneous Markov model (NHMM) $\stackrel{˜}{M}$ is created given a finite length value l, a trained Markov model M =,Pi,PT), and a set C of unary constraints. The value of l determines the length of the sequence X = (X1, …, X1) that $\stackrel{˜}{M}$ models. We can partition C into subsets C1, …, C1 such that subset Ci represents constraints applying to Xi. $\stackrel{˜}{M}$ has the following properties:

1. $\stackrel{˜}{M}$ only assigns non-zero probability to sequences for which C is satisfied;
2. Sequences that satisfy C are assigned the same probability in M and $\stackrel{˜}{M}$ up to a constant factor.

$\stackrel{˜}{M}$ is constructed by defining transition matrices M1, …, Ml–1 where each matrix Mi is initialized as a copy of the transition matrix PT of M. Each Mi is then modified according to the constraints in Ci. The constraint satisfaction problem is made arc-consistent,2 and probabilities are normalized to ensure solutions are assigned the same relative proportion of the probability density (see Pachet et al. (2011) for details on enforcing arc-consistency and normalization).

The properties, construction, and normalization of $\stackrel{˜}{M}$ are the same regardless of the Markov order of M. M has traditionally been presented as a single-order Markov model M1 which generates a sequence according to the probability function P(X1, …, Xn) = P(X1) · P(X2|X1) · … · P(Xn|Xn–1). However, M may just as easily be a d-order Markov model, Md, which generates a sequence with a longer memory such that P(Xi|X1, …, Xi–1) = P(Xi|Xi–d, …, Xi–1). The fact that the order of M does not affect the NHMM construction falls out immediately from the observation that Md can be created using a single-order Markov model on an alphabet of d-grams (Papadopoulos et al., 2015). More precisely this means changing the finite state space alphabet Σ of M1 (where each element ai ∈ Σ is a single element) to a state space Σd = {A1, …, Am} where each element Aj ∈ Σd represents a d-tuple of elements from Σ.

In Md a transition from tuple Aj to tuple Ak exists if and only if the (d–1)-length suffix of Aj and the (d–1)-length prefix of Ak are identical. Thus when probabilistically sampling a sequence from Md or ${\stackrel{˜}{M}}_{d}$, all unary tokens in the first sampled sequence-state and the final token of each subsequently sampled sequence-state are concatenated to form the generated token sequence (e.g., see Figures 1 and 2). Thus a d-order NHMM of length l generates sequences of length (d + l – 1).

Figure 1

Transition probabilities for a 3rd-order Markov model over words. The model has been trained on the phrases “once I saw a bear with hair” and “once I saw a cat with hair”. Each state in this model is a tuple of 3 elements and transitions are between tuples that overlap by all but one element. Note that though a path through this model will have length 5, the generated token sequence will have length 7 (i.e., element sequence length + order – 1).

Figure 2

Transition probabilities for a 3rd-order NHMM of length 4. This model is built from the Markov model in Figure 1. The diagram shows the result of first modifying transition matrices M1,M2, and M3 according to two unary constraints: C3 = (X3, {x | x is a tuple of words containing at least one preposition}) and C4 = (X4, {x | x is a tuple of words in which the first and last words rhyme}). States marked with a white ‘X’ are pruned due to the length constraint (i.e., transitions through these states do not result in element sequences of length 4). States marked with a gray ‘X’ are pruned due to the addition of the C3 constraint. This constraint is an example of a floating constraint in that the part of speech (POS) constraint is effectively satisfied by any satisfying token appearing at sequence positions 3, 4, or 5. States marked with a black ‘X’ are pruned due to the further addition of the C4 rhyme constraint. The C4 constraint is an example of a dynamic constraint in that in the 6-word sequence generated from the model, the rhyme group relating the 4th and 6th words is dynamically chosen at run time. Grey transitions represent transition probabilities that are zeroed as a result of applying or ensuring arc-consistency of constraints.

### 2.2 Binary constraints in an NHMM

The NHMM model was originally defined to enforce unary constraints. With a d-order NHMM model, each state in the model represents a d-tuple. In this formulation, short-range dependencies can already be enforced if the range of the dependency is less than or equal to d. For example, when generating a 6-length sequence of words with a 3rd-order NHMM, we can enforce a rhyming constraint between the 4th and 6th words (a range of 3) by zeroing out transitions in M3 to 3-tuples in which the first and last words do not rhyme (see Figure 2). Thus the Markov order defines a range within which short-term dependencies can be modeled using unary constraints in the traditional NHMM model. Because the NHMM model can be constructed in polynomial time, this represents an efficient method for modeling short-range dependencies.

In a traditional d-order NHMM, constraints are unary and thus can only be applied to prune d-tuples at one state position, Xi. However, with only a slight adjustment, a d-order NHMM can be constructed to allow binary constraints applied to d-tuples at two adjacent positions, Xi-1 and Xi. To implement this adjustment, consider that to enforce a unary constraint (Xi,ρ) in an NHMM involves pruning the transition matrix Mi–1. However, Mi–1 represents transitions from Xi1 to Xi, and thus we can choose to zero out transition probabilities based on constraint criteria that depend on both Xi–1 and Xi. If a particular (xi–1,xi) pair does not satisfy a binary constraint (Xi–1,Xi,ρ), we zero the transition from xi–1 to xi in Mi–1. Thus without increasing the polynomial runtime required for NHMM construction, we can extend the range of short-range dependencies from the Markov order (d) to d + 1. Considering that Markov orders are commonly less than 5, this effectively increases the range of enforceable short-range dependencies in NHMMs by 20% or more.

See endnote 1 for a link to our source code for building d-order NHMMs with binary constraints.

### 2.3 Applications of dependencies in sequence generation

There are many different types of dependencies that can be represented as relational constraints in higher-order NHMMs. We will look at two examples here. These same types of dependencies are possible and are subsequently demonstrated as long-range dependencies, but the more narrow range of short-range dependencies provides a better context for an introduction. To facilitate ease of understanding we will use natural language to demonstrate the examples, but they are just as powerfully applied to sequences of pitches, rhythms, chords, or other musical viewpoints. When referencing parts of speech (POS), we use the POS tags from the Penn Treebank Project.3 For all POS labeling we use the Stanford CoreNLP Toolkit Manning et al. (2014).

#### 2.3.1 Dynamic dependencies

As mentioned in the introduction, one way of evoking the impression that two positions of a composition are related (e.g., in a repeated motif) is to impose identical or related unary constraints at each position. For example, Barbieri et al. (2012) use an NHMM to generate lyrics such that last word variables in the first and second lines are respectively constrained to be the word “today” and to rhyme with “today”. Note that once the constraints have been defined, the variables representing the words at these two positions are completely independent. We call this type of constraint a static constraint because all of the criteria needed to make the assignment to one variable independent of the other is statically defined before the model is built.

Using static constraints to create (the illusion of) dependencies comes with huge limitations. As an example, consider that the rhyme group to which “today” belongs is only one of thousands of possible rhyme groups, meaning the range of viable solutions that are accessible to the model is reduced by several orders of magnitude. In the case of music where we might constrain pitches, rhythms, or chords to match in order to create a motif, static constraints have a similarly reducing effect on the solution space, with the exact reduction being equal to the size of the pitch, rhythm, and chord domains. This means an 88× reduction to model a pitch dependency with static constraints (if we limit ourselves to the range of pitches on a standard keyboard); a 15× reduction to model a rhythm dependency (if our shortest duration is a 16th note, our longest a whole note, and allowing double-dotted durations); and a 4,000× or more reduction to model chords.4 This problem grows exponentially with each additional constraint set needed to construct the motif or other motifs.

Relational constraints in higher-order NHMMs enable a class of dependency we call dynamic dependencies. A dynamic dependency is true dependency in that the assignment to one variable cannot be made independent of another. In that sense, the precise criteria by which dynamic dependency is satisfied at a particular sequence position is determined at runtime as a function of assignments at other positions. Using a dynamic constraint, the system determines dynamically whether a particular pair of words belong to the same rhyme group without needing to statically define the rhyme group of either word ahead of time. Dynamic dependencies recover all of the expressive power lost by the use of static constraints to create dependencies.

A dynamic dependency is achieved in d-order NHMMs by testing whether or not the d + 1 length sequence deriving from assignments to two adjacent variables Xi and Xi+1 represents a subsequence in which two elements of interest satisfy the desired property (e.g., rhyme). Without additional modifications to the NHMM, this requires that the dynamically constrained positions must be within d + 1 positions of each other, thereby maintaining the Markov property (see Figure 2). This makes dynamic constraints particularly well-suited for genres like rap where rhymes are often closely situated.

In many applications the dynamic constraints of interest relate elements at very distant positions. Such dynamic constraints are not well-suited for the approach we have outlined thus far. As the Markov order d increases, fewer satisfying solutions can be generated and solutions tend towards replicating exact phrases from the training corpus. Papadopoulos et al. (2014) demonstrate a regular constraint solution to the problem of sampling exact phrases above a certain length from training data. Regular constraints are also a viable alternative for generating music with long-range dependencies, which we demonstrate in Section 3.

The examples in Section 2.4 use values of d in the range 4 ≤ d ≤ 8. In these cases we have elected to retain exact phrases both for the sake of demonstration and because there remains some possibly important aesthetic value in repurposing existing artifacts in novel problem domains (e.g., an objet trouvé). In fields such as computational creativity these solutions represent a significant contribution in their own right (Colton et al., 2010).

#### 2.3.2 Floating dependencies

Often when composing a sequence with dependencies, it is desirable to allow some flexibility in precisely where the dependency is satisfied. We call this type of dependency a floating dependency. For example, rather than constrain the ith position of a phrase to be a particular POS (e.g., a noun), we can constrain the model such that at least one token in the range of positions [i – d, i] must be a particular POS. Using a d-order Markov model, this dependency can be enforced using the binary constraint (Xi–d,Xi–d+1,ρ) where ρ = {x, y |the dependency is satisfied by at least one element in the d-tuples x or y}. Floating dependencies are useful for granting the model greater flexibility in generation.

An example of the application of floating dependencies is in defining semantic constraints. Barbieri et al. (2012) use an NHMM to generate a poem related to the concept of “music”. To do so they define unary constraints at random, fixed positions to impose the corresponding word be semantically related to “music” in addition to imposing other rhythm and POS constraints. By their own admission, the approach has limited flexibility because of having to define unary constraints at specific positions. Considering that theirs was a 2nd-order NHMM, we might reasonably expect that reformulation of their constraint as a short-range floating dependency, allowing the semantically-relevant word to be generated at one of 3 positions, would triple the number of “music” poems that the model could compose.

Floating dependencies can also be used to not over-restrict syllable-level NHMMs when imposing word-level constraints. Using a Markov process over words, Barbieri et al. (2012) are able to constrain according to word-level POS templates. For example, the word-level POS template for the phrase “Yesterday, love was such an easy game to play” is denoted as [NN, NN, VBD, JJ, DT, JJ, NN, TO, VB]. Although we can impose the same constraints using a syllable-level POS template (e.g., [NN, NN, NN, NN, VBD, JJ, DT, JJ, JJ, NN, TO, VB]), note that doing so reduces the expressiveness of our syllable-level model by up to a factor of 2l–1 for a sequence of length l insofar as we limit the model’s ability to determine the numbers of syllables per word and words per phrase. One solution is to place POS constraints at only a subset of positions (e.g., [NN, , , NN, VBD, JJ, DT, JJ, , NN, TO, VB]), allowing the model to fill in the gaps. This solution, however, still severely restricts the model’s expressiveness. It also requires statically defining a sequence of POS tags allowed at each position, essentially limiting the number of allowable POS sequences to one.

Floating dependencies solve this problem by providing a means of imposing word-level POS templates without sacrificing expressiveness and with the ability to allow multiple valid POS sequences. In a syllable-level d-order NHMM, each transition matrix Mi essentially encodes all possible (d + 1)-length syllable sequences that could result from assignments to the variables Xi and Xi+1. Associated with each syllable is the POS of the word from which the syllable was derived, enabling extraction of a (d + 1)-length syllable-level POS sequence (e.g., the syllable-level POS sequence for the syllable sequence “lla·ma wear·ing pol·ka-dot pa·ja·mas” would be [NN,NN,VBG,VBG,NN,NN,NN,NNS,NNS,NNS]). Implemented as a floating dependency (Xi,Xi+1,ρ}), each such (d + 1)-length syllable-level POS sequence can be tested against a word-level POS template (e.g., ρ = {x, y |the syllable-level POS sequence represented by the tuples x and y must equal [NN,VBG,NN(S)] after condensing consecutive repeat tags}). In this way word-level POS constraints can be imposed in syllable-level NHMMs without specifying a precise position where they must be satisfied. Furthermore, ρ can be made to allow multiple POS templates (see Figure 3). In this way syllable-level sequences can be validated using many different word-level lexical analyses (e.g., noun phrase classifiers).

Figure 3

Floating syntax constraints. Shown are two 10-syllable phrases representing a transition between two 9-length syllable tuples in a 9th-order NHMM, each with its syllable-level POS template. The tree represents a floating word-level POS template constraint. Each path through the tree represents a POS sequence that is valid per the constraint. A transition is pruned from the NHMM if its syllable-level template, when identical consecutive tags are condensed, does not have a valid path through the tree. This is a floating dependency because the POS tags from the constraint are not imposed on specific positions in the syllable-level template. Thus despite having different syllable-level POS templates, both phrases satisfy the constraint via the same path (grey).

### 2.4 Applied examples

In each of the three following examples we demonstrate a unique advantage afforded by dynamic and/or floating dependencies in NHMMs.

#### 2.4.1 Dynamic relational constraints in lyrics

We demonstrate the use of dynamic dependencies in the context of a lyric-generation problem which we call the Down by the Bay (DBTB) problem. “Down by the Bay” is a traditional song requiring improvisation of novel lyrics which conform with rhythmic, rhyming, and POS patterns representative of the song’s characteristic structure.

After careful consideration of several prototypical solutions to the DBTB problem, we found that solutions generally followed one of 5 rhythmic templates (i.e., sequences of stressed [1] and unstressed [0] syllables): [011001], [0110101], [01010010], [01101011], or [01010101010]. Because there are variable satisfactory lengths and the length l for an NHMM must be fixed a priori, we create a syllable-level d-order NHMM for each rhythmic template r = (r1, …, rn)—where d equals one less than the length of the maximal subsequence of r that starts and ends with 1, i.e., 4 ≤ d ≤ 8—with length l = n – d + 1 (e.g., see Table 1). For each model we extracted a set of constraints on the sequence X= (X1, …, X1) to be generated for solving the DBTB problem as follows:

1. (X1,{x | x is a d-length syllable tuple whose first syllable is the word ‘a’ or ‘an’})
2. (X1,{x | x is a d-length syllable tuple whose last syllable is the final syllable in a word})
3. For i ∈ [1, l], (Xi, {x | x is a d-length syllable tuple whose jth syllable is stressed if ri+j–1= 1 and unstressed if ri+j–1 = 0})
4. (X1, X1, {x,y | x and y are d-length syllable tuples and the 2nd syllable of x rhymes with the dth (if rn = 1) or (d – 1)th (if rn = 0) syllable of y})
5. If rn = 0, (X1,X1, {x,y | x and y are d-length syllable tuples and the 3rd syllable of x rhymes with the dth syllable of y})
6. (X2,X3,{x,y | the syllable-level POS sequence associated with the syllable sequence derived from the d-length syllable tuples x and y must satisfy the requirements for a noun phrase})

Table 1

d-order NHMMs with Floating and Dynamic Constraints for Solving the DBTB Problem.

 NHMM 1 NHMM 2 NHMM 3 NHMM 4 NHMM 5 NHM Order (d) 4 5 5 6 8 NHM Length (l) 3 3 4 3 4 Syllable Seq Length (n) 6 7 8 8 11 Rhythmic Template (r) [011001] [0110101] [01010010] [01101011] [01010101010] Training Sentences 3,892,039 2,654,884 2,654,884 2,051,040 1,239,850 Training Pronunciations 366,062,046 286,075,704 286,075,704 255,086,072 208,330,754 Solutions Generated 30 5 5 4 Not Satisfiable Generated Example “a dish of pickled fish” “a cot and a chamber pot” “a pillar that was a mirror” “a mouse or a rat in the house” n/a Average Novelty 3.65 3.66 4.13 3.40 n/a Average Rhyme 4.18 4.21 1.94 4.00 n/a Average Rhythm 3.12 3.30 2.13 3.16 n/a Average Amusement 2.53 2.39 2.06 2.48 n/a Average Likability 2.51 2.66 2.17 2.82 n/a

Note that dependencies 4 and 5 are examples of dynamic dependencies spanning the rhyming positions. Dependency 6 is an example of a floating dependency that uses a syntax tree with POS templates derived from traditional DBTB solutions (see Figure 3).

As training data for our DBTB models we used the Corpus of Contemporary American English (COCA) (Davies, 2009). To improve parsing we selected only complete phrases (as delimited by any non-alphabetic or apostrophe characters) which had a maximum of 30 space-delimited word tokens. Sentence pronunciation was inferred using word pronunciations from the CMU dictionary5 and the CMUSphinx grapheme-to-phoneme converter (Walker et al. 2004). Models were trained on each unique pronunciation of the sentence, with each pronunciation being down-weighted in the model proportional to the number of pronunciations for the sentence.

We trained five NHMMs, one for each of the rhythmic templates identified above. All but one of the five trained models found satisfying solutions (see Table 1). Each model was trained on sentences with syllable sequence length ls ∈ [d,30]. The system, implemented in Java 1.8, trained with 24 cores and 256 GB of RAM for 5 hours and 43 minutes.

To evaluate the effectiveness of higher-order NHMMs, we performed a quantitative assessment of subjective reactions to the DBTB solutions generated by our models. We obtained 62 human-generated solutions by inviting university faculty and students to “come up with your own novel ending”. Of these we sampled a random subset to obtain an equal number of computer-generated and human generated solutions (44 of each). We conducted 94 surveys in which participants were asked to rate five DBTB solutions (randomly and evenly sampled from our mixed pool) on Likert scales for novelty, rhyme, rhythm, amusement, and overall likability (each solution thus being rated an average of 5.34 times). Results are shown in Figure 4 and Table 1.

Figure 4

Qualitative evaluation. Results of 470 survey responses rating human- and computer-generated solutions to the DBTB problem. Error bars represent standard error.

The top five highest-scoring computer-generated solutions, their scores (averaged over the 5 criteria), and the NHMM that created them were:

• “a way out of the bay” – 4.04 – NHMM 1
• “a place in Chevy Chase” – 3.85 – NHMM 1
• “a sign of the decline” – 3.80 – NHMM 1
• “a scar shaped like a star” – 3.75 – NHMM 1
• “a stream that wound like a dream” – 3.74 – NHMM 2

The five median-scoring computer-generated solutions were:

• “a boy with a new toy” – 3.16 – NHMM 1
• “a tree where I could see” – 3.06 – NHMM 1
• “a thread from the side of his head” – 3.04 – NHMM 4
• “a sheet over the seat” – 3.03 – NHMM 1
• “a sty in her left eye” – 3.00 – NHMM 1

The five lowest-scoring computer-generated solutions were:

• “a pillar that was a mirror” – 2.7 – NHMM 3
• “a man who trained and ran” – 2.67 – NHMM 1
• “a ticket at the last minute” – 2.51 – NHMM 3
• “a comer of the room mother” – 2.25 – NHMM 3
• “a vision of the ambition” – 2.24 – NHMM 3

We found that the novelty scores of computer-generated solutions were competitive with those of solutions generated by humans. The rhyme scores were also very competitive, with the difference being partially explained by our failure to consider the matching of syllable onsets in multisyllabic words (e.g., “pillar”,”mirror”) in NHMM 3. We hypothesize that the difference observed in the rhythm score may be due to the confounding variables of rhyme and grammaticality. For example, the poor rhyming performance of NHMM 3 (the only model with multisyllabic rhymes) may have carried over into the ratings for other aspects of NHMM 3 solutions. Likewise it seems a reasonable hypothesis that, although poor grammar does not affect rhythm, it may nonetheless be perceived to affect the rhythm; however, we did not include grammaticality in our assessment. The amusement and likability scores, with which the computer struggled most, may have been affected by the genre-appropriateness of some of the computer-generated solutions (e.g., “a flood of spurting blood”, “an orphan and an abortion”).

#### 2.4.2 Floating semantic constraints in haiku

To demonstrate the generality of floating dependencies, we designed two different syllable-level NHMMs to generate haiku with floating semantic dependencies. The first was a 5th-order NHMM of length 13 with a nature-themed floating semantic constraint applied over the first 5 syllables. The second was a 4th-order NHMM of length 14. In this model we applied a floating word-level POS template constraint (similar to those described for the DBTB models) over the last 5 syllables of each line. Each line-specific constraint had a syntax tree populated with word-level POS templates parsed from corresponding lines in a database of existing haiku.

We imposed a beauty/earth-themed floating semantic constraint over the last 4 syllables using word2vec (Mikolov et al. 2013) to ascertain semantic relatedness between words. In both models, each line was constrained to start and end with word-starting and word-ending syllables. Figure 5 shows an example of manually selected haiku generated from each model. The haiku from the first model was chosen from among 7 randomly generated haiku (all 7 represented exact sequences from the training data). The haiku from the second model was chosen from among 14 randomly generated haiku (only 1 out of 14 from the second model was an exact sequence from the training data). We again note that solutions exist to prevent plagiarism of exact sequences from the training data, but we did not employ such methods in these models (Papadopoulos et al., 2014).

Figure 5

Haikus. These haikus are generated from syllable-level NHMMs with floating dependencies. (Left) A haiku found using a 5th-order NHMM with a nature-themed floating semantic constraint. (Right) An original haiku generated from a 4th-order NHMM with floating word-level POS template constraints and a beauty/earth-themed floating semantic constraint.

#### 2.4.3 Floating stress constraints in prosody

For generating prosodic rhythm, we designed a 4th-order NHMM over rhythm tokens of length 6 to generate rhythms for the lyrics “No more monkeys jumping on the bed”. We trained the model on lyric/rhythm sequences from the Wikifonia dataset. Given the rhythmic stress template of the lyrics, [111010101], the model imposes floating stress constraints requiring only two thirds of syllable stresses to match appropriately to emphasized rhythmic positions (one constraint over the first five syllables and another over the last four). The model has 4/4 floating time signature constraints at all positions (see Figure 6).

Figure 6

Prosodic rhythm for lyrics. Given the lyric “No more monkeys jumping on the bed!”, we used a 4th-order NHMM over rhythm tokens to generate prosodic rhythms like those shown here. Stressed syllables are bold and notes in emphasized rhythmic positions are in parentheses.

In the first example in Figure 6, a rhythm is generated which places all six stressed syllables on emphasized rhythmic positions. However, in the second example, only four of the six stressed syllables align with emphasized rhythmic positions (the minimum required to satisfy the floating constraint). As a result, two stressed syllables fall on offbeats, creating a syncopated effect with accents in unstressed bar positions. This example demonstrates that not only can the model generate rhythms to complement arbitrary lyrical sequences, but that it is capable of meeting constraints while being given some creative autonomy to break rules or constraints to some controlled degree. In a similar fashion, floating constraints might be used to allow autonomous music generation systems to create lyrical, melodic, harmonic, or rhythmic variations on a motif, enabling generation of songs that adhere to broad forms of structural repetition but with nuances at each repetition.

## 3 Generating music with steerable long-range dependencies

Most motifs and dependencies in music occur across ranges far longer than reasonable Markov order. To address this need we demonstrate a new method for modeling long-range dependencies with NHMMs. To our knowledge, this is the first method which has been demonstrated to generate compositions according to Markov probabilities, ensuring a local cohesion throughout the sequence, with long-range, dynamic dependencies, allowing the imposition of wide-spread motifs and repetitions. There are no limits on the range length of dependencies using this approach. The Markov order can be chosen without impact on the ability to impose long-range dependencies. We demonstrate the abilities of this model by generating sheet music compositions with verse-chorus structures, including repeating pitch, rhythm, harmonic, and lyric motifs.

An important key to generating music with long-range dependencies is the recognition that n-ary relational constraints can, in fact, be modeled using regular constraints. This may at first appear to be at odds with computational theory. Consider, for example, that the set of all palindromes—a quintessential example of a relational constraint—is not a regular language. However—and this is the key—the set of all palindromes of a particular, finite length is a regular language. Thus given a finite length of the composition to be generated, we can model long-range dependencies using regular constraints. As any regular language can be recognized/generated by a deterministic finite automaton (DFA), we take the approach of creating a DFA that incorporates all long-range dependencies and then combine this DFA with an NHMM in order to ensure that sequences recognized by the DFA are sampled with the correct probability.

### 3.1 Automata for relational constraints

The exposition of the method for imposing long-range dependencies relies on the definitions of relational constraints, Markov models, and NHMMs provided in Section 2.1. In addition, we define a DFA as a quintuple A = ⟨Q, Σ, δ, q0, F⟩ in which:

• Q is a finite set of states;
• Σ is a finite set of symbols termed the alphabet;
• q0Q is the initial or start state of the automaton;
• δis the transition function Q × Σ → Q, mapping a state to a successor state for a given symbol;
• FQ is the set of final or accept states.

A sequence s = (s1, …, sn) is accepted by A iff there exists a sequence q0, …, qn of states such that ∀i ∈ [1,n], δ(qi–1,si) = qi and qnF. The language L(A) is the set of all sequences which A accepts.

Note that when combining a DFA and Markov model, the domain Σ for variables X1, …, Xn is the same as the alphabet Σ for the DFA.

Given a set of binary relational constraints C = {(Xi,Xj)}, a set of valid start symbols I ⊂ Σ, a set of valid transitions T = {sisj | si ∈ Σ and sj ∈ Σ}, and a length n, Algorithm 1 creates a RELATIONALautomaton, i.e., a DFA A such that L(A) is the set of all sequences s ∈ Σn such that s1I; si–1siT for i ∈ [1,n]; and (sj,sk)ρ for all relational constraints (Xj,Xk) in C. Where I and T are determined from Pi and PT of a Markov model M, this means Pm(s) > 0.

The algorithm takes the general approach of building breadthwise a tree-like DFA where each layer (of edges) in the tree represents a position in the sequence to be generated. Each state qi in the DFA is associated with one of the variables Xj and is defined by a triple ⟨Cqi,Vqi,qi⟩ where Cqi is a set of relational constraints representing dependencies whose range includes position j; Vqi ⊆ Σ is the set of valid edge labels allowed by T for edges leaving the state; and qi is a reference to the state itself. Whereas a binary constraint in C relates two variables, a constraint (a,Xk) in Cqi relates an element a ∈ Σ to a variable Xk.

When expanding the DFA from a state ⟨Cqi,Vqi,qi⟩ associated with Xj, a new path is considered for every label vVqi. In order to build the path, the algorithm first computes what would be the new triple ⟨Cq′,Vq′,q′⟩ for the new state reached via v in order to check whether a state with the same constraint and edge labels already exists. The new set Cq inherits all binary constraints from Cqi except those relating an element a ∈ Σ to the variable Xj. For each constraint (Xj,Xk) ∈ C, we also add a constraint (v,Xk) to Cq. The new set Vq is populated with the set of all labels v2 such that 1) vv2 is a valid transition according to T and 2) (a,v2) ∈ ρ for any constraint (a,Xj+1,ρ) in Cq. In practice, Vq can be further constrained by unary constraints of the form (Xj+1, ρ) to further prune undesirable sequences. Prior to adding q′ and assuming Vq is not empty (a sign of a dead state), we check whether a state associated with sequence position j + 1 with the same constraint set Cq and edge label set Vq already exists. In the last state layer there is a single accepting state to which all states in the previous state layer (i.e., those states associated with the last variable, Xn) connect via all valid labels in their respective edge label sets.

Figure 7 shows the DFA built using Algorithm 1 with n = 4; C = {(X1,X4,ρrhyme)} (where ρrhyme represents the set of rhyming word pairs); I = {Mary, Clay}; and T derived from the non-zero transitions represented in the Markov model shown in Figure 8.

Figure 7

ARELATIONALautomaton. The result of Algorithm 1 on inputs n = 4; C = {(X1,X4,ρ)} (where ρ represents the set of all rhyming word pairs); I = {Mary, Clay}; and T derived from the non-zero transitions represented in the Markov model shown in Figure 8. Dead states and paths have been removed.

Figure 8

A Markov model. The model shown is a modified version of the model exemplified by Barbieri et al. (2012) to whom we pay tribute for having in large measure inspired this work. Missing is the initial probability distribution for the model which is Pi = {Mary, 0.5; Clay, 0.5}.

The time and space requirements for the algorithm vary significantly depending on the inputs. Each constraint in C at a new position splits the path into a number of paths dependent on the number and distribution of transitions in T and the restrictiveness of the constraint’s relation ρ. Divergent paths will reconverge once all overlapping constraints have been resolved. The number of paths and states can also be reduced by adding unary constraints, including negated relational constraints, increasing the number of symbols per label (i.e., increase the Markov order), and reducing the size of the transition set T.

One way to optimize the construction of a RELATIONAL automaton is to take advantage of binary relational constraints for which ρ is an equivalence relation (i.e., reflexive, symmetric, and transitive). In such cases many partially instantiated constraints become equivalent. For example, under the equivalence relation of rhyming (and allowing for a word to rhyme with itself), the following partially instantiated binary relational constraints are equivalent: (Mary,X4,ρ), (Fairy,X4,ρ), (Carey,X4,ρ). These constraints can all be represented using a generalized constraint that uses the equivalence class EMary representing the set of all rhyming word pairs that rhyme with “Mary”: (EMary,X4,ρ). This facilitates the combining of states in the DFA.

### 3.2 An NHMM with regular constraints

Given a RELATIONAL automaton A, a Markov model M, a set of any additional unary constraints U, and a length n we can construct an NHMM which samples sequences from the target distribution P* defined as:

$P*\left({X}_{1},\dots , {X}_{n}\right)\alpha \left\{\begin{array}{c}{P}_{M}\left({X}_{1}, \dots , {X}_{n}\right). \\ {\prod }_{i=1}^{n}{P}_{U}\left({X}_{1}\right) \mathrm{if}\\ 0 \mathrm{otherwise}\end{array}\left({X}_{1}, \dots , {X}_{n}\right)\in L\left(A\right)$

where for sequences (X1, …, Xn) which fail to satisfy U the term PU(Xi) = 0 for some i ∈ [1,n], resulting in P*(X1, …, Xn) = 0. It is more efficient to apply constraints in U when creating A because it further limits the size of the tree. We choose not to demonstrate this here for three reasons. First, the primary goal in presenting the RELATIONAL automaton is for its ability to enforce nonunary constraints, and our aim is to keep the explication of its construction as clear and simple as possible. Second, the incorporation of unary constraints into Algorithm 1 can be intuited relatively easily: for each unary constraint (Xj,ρ), prune non-satisfying elements from the edge label set Vqi for states qi associated with Xj. Third, the construction of an NHMM presented by Pachet et al. (2011) by default assumes as input a set of unary constraints, allowing us an already defined method for dealing with these types of constraints.

As automata lack probabilities, we use the RELATIONAL automaton A to construct an NHMM as follows. Given an automaton A =Q, Σ, δ, q0, F⟩, a Markov model M=,Pi,PT) (the same as that used to construct A), a set of unary constraints U (partitionable into subsets U1, …, Un such that subset Ui represents unary constraints applying to Xi), and a length n, Algorithm 2 builds a new “state-sensitive” Markov model M′ which incorporates the relational constraints represented by the regular automaton A (see Figure 9). Each Markov label in M′ represents a label-state pair ⟨x,q⟩ where x ∈ Σ is a label of the original Markov model M and qQ is a state of the automaton A. Whereas with M we can directly sample values x1, …, xn for a sequence of random variables X = (x1, …, xn), in the “state-sensitive” model M′ we first sample values ${x}_{1}^{\prime }$,…, ${x}_{n}^{\prime }$ for a sequence of random label-state variables X= (${x}_{1}^{\prime }$, …, ${x}_{n}^{\prime }$). We obtain X through the assignment X = (${x}_{1}^{\prime }$.label, …, ${x}_{n}^{\prime }$.label) where ${x}_{n}^{\prime }$.label is the value of the label in the label-state pair ${x}_{i}^{\prime }$.

Figure 9

A “state-sensitive” pseudo-Markov model. This is the model M’ built using Algorithm 2 given as inputs the automaton in Figure 7, the Markov model in Figure 8, an empty unary constraint set, and a length n = 4. This is a “pseudo”-Markov model because, given this approach, probabilities must remain unnormalized for proper construction of the NHMM.

The set of initial probabilities ${P}_{I}^{\prime }$ of M′ is defined as

${{P}^{\prime }}_{I}\left(〈x, q〉\right)\alpha \left\{\begin{array}{c}{P}_{I}\left(x\right) \mathrm{if} q=\delta \left({q}_{0}, x\right)\\ 0 \mathrm{otherwise}\end{array}$

and the transition probability function ${P}_{T}^{\prime }$ of M′ is defined as

${{P}^{\prime }}_{I}\left(〈{x}^{\prime }, {q}^{\prime }〉|〈x, q〉\right) \alpha \left\{\begin{array}{c}{P}_{I}\left({x}^{\prime }|x\right) \mathrm{if} {q}^{\prime }=\delta \left({q}_{0}, {x}^{\prime }\right)\\ 0 \mathrm{otherwise}\end{array}$

A new set of unary “state-sensitive” constraints U′ is also created such that for each unary constraint (Xi,ρ)U, an equivalent constraint (${x}_{i}^{\prime }$,{x′ | x.label ∈ ρ}) is added to U′. An accepting constraint (${x}_{n}^{\prime }$,{x′ | x.stateF}) is also added to U′ to ensure that all valid Markov sequences from M’ end with a label-state pair x= (x,q) where qF.

The final step of Algorithm 2 constructs an NHMM from the unnormalized M′, U′, and n using the construction set forth by Pachet et al. (2011) which enforces arc-consistency and normalizes probabilities to create the target distribution P*.

#### 3.2.1 Comparing NHMMs and Factor Graphs

Papadopoulos et al. (2015) demonstrate a mechanism for computing P* that uses factor graphs with belief propagation for sampling sequences with exact probabilities according to the original distribution. In their solution a backward phase computes the impact on each sequential position i in the factor graph of the sub-factor graph representing all positions “to the right” of i. A forward phase computes the marginal distribution over values for each sequence position i given the partial instantiation for variables representing positions “to the left” of i and simultaneously samples a sequence. A potential drawback to this solution is that though the backward phase is completed once, the forward phase is repeated each time a sequence is sampled.

The approach we present, shown in Algorithm 2, samples sequences under regular and Markov constraints with exact probability using an NHMM. An NHMM computes all marginal distributions ahead of time, speeding up sampling time at the cost of an increased build time (see Figure 10). This makes it a better choice for online sampling or when sampling larger sets of sequences. For a detailed explanation of NHMMs and the method of their construction (referenced as “constructNHMM ()” in Algorithm 2) we refer the reader to Pachet et al. (2011).

Figure 10

Sample Time By Length. Shown are average sample times for the NHMM (blue) and factor graph (orange) from sampling 100,000 sequences belonging to the set {aa + b+}. Both sample times increase linearly with the sequence length. Though the sample time per sequence is always lower for the NHMM, the NHMM build time also increases with sequence length resulting in a lower amortized sample time (dotted lines) for factor graphs as the sequence length increases.

## 4. Example compositions with long-range dependencies

RELATIONAL automata are effective tools for imposing both Markovian transitional constraints (local, horizontal structure) and relational constraints (global, long-range structure). This is well-demonstrated in using binary relational constraints to constrain parallel interrelated sequences, as we show in the following example of generating a musical sequence with global verse-chorus structure.

Verse-chorus structure is effected in lyrical sectional-form music when subsequences of different types or viewpoints (e.g., chords, pitches, rhythms, and lyrics) are repeated at intervals throughout the musical sequence. For example, a verse is generally regarded as a segment where the chords, pitches, and rhythms are repeated, but the lyrics are different and a chorus is generally regarded as a segment where all four of those viewpoints are repeated.

Repetition of subsequences requires the use of a series of binary relational constraints using the equality binary relation. Rather than hand-craft a set of relational constraints Cv for each viewpoint v, we learn existing patterns of repetition from normalized data and then translate these patterns into sets of binary relational constraints. For each of four viewpoints (chords (H), rhythm (R), lyrics (L), and pitch (P)) we perform multi-Smith-Waterman (mSW) self-alignment on the discretized musical sequence using parameterizations reported in Bodily and Ventura (2021) on an existing composition. An example of structural patterns learned from the song Twinkle, Twinkle, Little Star is shown in Figure 11. For each viewpoint v ∈ {H,R,L,P} the n × n alignment matrix (where n is the length of the song) identifies matching regions (signified by dark red diagonals) in viewpoint v of the training song. For a non-zero entry in row i, column j of the matrix we extract a binary relational constraint (Xi,Xj,{x,y|x = y}).

Figure 11

Inferring Relational Constraints. Relational constraint sets are inferred from real data using multiple-Smith-Waterman sequence alignment. Shown are the structural patterns inferred for the chord, pitch, rhythm, and lyric sequences in Twinkle, Twinkle, Little Star. The minute textual labels on the axes in each graph are the lyrics to the song and are merely a graphical reminder of what the axes represent.

In addition to the global, long-range structure imposed by these binary relational constraints, we added the following constraints:

Chord. Because binary relational constraints are designed to enforce structural repetition, it is not uncommon for generated sequences (particularly those from models trained on small data sets) to become overly repetitive, repeating structure even when it is not enforced. To counteract this effect in the chord sequence, we added a second set of binary relational constraints which constrained a subset of positions within regions not aligned via the mSW alignment to not match according to the equality relation. The model is constrained to generate sequences starting and ending with the same chord, namely a C major or A minor chord.

Rhythm. We add unary constraints to make the generated rhythmic sequence end with a half-note rhythm to evoke a sense of finality.

Lyric. The length (in syllables) of the lyric sequence n1 is derived from the number of sampled rhythm tokens representing non-rest rhythmic onsets. Prosody (i.e., aligning stressed syllables with stressed beats) is achieved by constraining syllables for offbeat rhythms to be unstressed. Rhyming is effectuated using a second binary relational constraint set ${C}_{L}^{\prime }$ ={(Xi,Xj,{x,y | x and y are rhyming syllables })}. ${C}_{L}^{\prime }$ is constructed such that the last syllables in measures 2 and 4 rhyme and the last syllables in measures 6 and 8 rhyme.

Pitch. We constrain the last pitch in the composition to be the root of the final chord. For all positions i we constrain the pitch at i to equal one of the pitches defining the chord at position i as per the previously sampled harmonic sequence.

We tried different Markov orders for the models used to generate each of the view points. We found that for generating eighth-note sequences of chords, a Markov order of 4–5 worked well. For eighth-note pitch sequences, 3–4 worked well. For eighth-note sequences of rhythms, 3–4 worked well. And for lyrics (we used a syllable-level model), a Markov order of 4–5 worked well.

The Markov models MH and MP for the chord and pitch sequences were trained on the chord and pitch sequence data in Somewhere Over the Rainbow. The rhythmic Markov model MR was trained on rhythms from 5 different songs to give the model a rich choice of rhythmic variation. The lyric Markov model M1 was trained on the lyrics from the Beatle’s Hey Jude and John Lennon’s Imagine.

Given the relational constraints Cv, the Markov model Mv, and these additional binary and unary constraints, we generate a RELATIONAL automaton Av and a REGULAR NHMM ${x}_{1}^{\prime }$ v for each viewpoint v ∈ {H,R,L,P}. The chord, rhythm, lyrics, and pitch sequences shown in Figure 12 were sampled from their respective NHMMs with probabilities 4.9e-5, 7.7e-6, 0.032, and 2.0e-3 respectively. Source code for building RELATIONAL automata and REGULAR NHMMs is available at https://github.com/norkish/pop-star.

Figure 12

Composition generated with long-range dependencies. Shown are four parallel sequences (chords, pitches, rhythms, and lyrics) generated using REGULAR NHMMs that exhibit both local, horizontal structure—each fully satisfies Markovian constraints—and global, long-range structure—each fully satisfies binary relational constraints. Boxes of the same color are used to illustrate subsequences which position-by-position (e.g., eighth note by eighth note) are constrained via binary relational constraints to be equivalent. Dark red and dark yellow boxes reflect binary relational rhyming constraints. Not labeled is the pattern of rhythmic repetition every 2 measures.

To demonstrate the efficacy of the model, we generated 12 compositions at random and analyzed their long-range dependencies. Of these 12 compositions, 7 were generated which satisfy a set of 147 constraints extracted from Twinkle, Twinkle, Little Star, and 5 were generated which satisfy a set of 273 constraints extracted from Somewhere Over the Rainbow. Constraints were automatically extracted at runtime using the mSW self-alignment method described by Bodily and Ventura (2021). An example of a set of constraints extracted via this method is represented by the coloring in Figure 12. A breakdown of the types and average lengths of the long-range dependency constraints in each set appears in Table 2).

Table 2

Long-range dependency constraints satisfied in generated compositions.

 Constraint Set 1 (Twinkle, Twinkle, Little Star) Constraint Set 2 (Somewhere Over the Rainbow) Form Ternary (ABA) 32-bar form (AABA) Measures 12 32 Events per measure 8 8 Composition length 96 256 Total constraint count Harmony 48 64 Rhythm 16 96 Lyrics 32 26 Pitch 48 80 Rhyme 3 7 Average constraint range (in eighth notes) Harmony 48 184 Rhythm 80 138 Lyrics 64 192 Pitch 48 160 Rhyme 16 27

Despite being generated from only 2 different sets of constraints, the 12 compositions are otherwise significantly varied. As shown in Figure 13, the majority of k-tuples (i.e., event subsequences) in any viewpoint are unique, even for small values of k, not to mention the added uniqueness that arises from the combination of viewpoints. The same variation is seen in how each of the 12 compositions satisfies its respective rhyming constraints. Of the 34 rhyming constraints collectively satisfied by the 12 compositions, over a dozen different rhyme groups are represented and only one rhyme is repeated (see Figure 14). The 12 compositions are available online at https://www2.cose.isu.edu/~bodipaul/research/pop_star/.

Figure 13

Variation in satisfying constraints. Though all 12 randomly selected compositions satisfy all specified constraints, they still remain highly varied in lyrics (blue), pitch (grey), harmony (red), and rhythm (orange). This is shown by considering that the ratio of k-tuples that are unique to one of the 12 compositions is high for even relatively small values of k. For lyrics, k is measured in words. For all other viewpoints, k is measured in eighth-note intervals (e.g., k = 8 represents one 4/4 measure). Compositions were normalized to a common key.

Figure 14

Variation in satisfying rhyme constraints. Of the 34 rhymes collectively incorporated into the 12 randomly selected compositions, all but one (94%) are unique and over a dozen different phonemic rhyming groups are represented.

## 5. Discussion and conclusion

Although for simplification we have focused exclusively on binary relations, the algorithms and logic in this approach could be adapted to allow for n-ary relations ρ and relational constraints (Xi1, …, Xin, ρ). Such adaptations are unlikely to negatively affect time- or space-complexity (except to the extent that they span longer distances) because they potentially constrain the model more heavily, thus reducing the search space significantly. These impacts and possible applications of n-ary relational constraints in Markov processes are a possible target of future research.

Of particular interest is the runtime required to construct a REGULAR NHMM. Because the RELATIONAL automaton built by Algorithm 1 essentially represents an exhaustive search tree, construction incurs an exponential runtime. While normally such a complexity might obviate the viability of the solution, we maintain that the algorithm is useful for two reasons. First, in practice the RELATIONAL automaton is rarely a full tree. The tree only expands in regions encompassed by long-range dependencies; outside of these regions, the tree collapses into a single chain. Pruning by constraints also significantly reduces the breadth of the tree throughout the search process. The fact that more constraints may be better from a runtime perspective may actually be encouraging: much future research into enhancing semantic meaning in music and lyrics is likely to evolve through constraints. Second, even for average-length pop music compositions, our experience suggests that the processing speeds of most modern computers are sufficient to generate solutions in reasonable time frames despite their increased complexity.

The solution we have presented enables probabilistic sampling of sequences that exhibit complex relational structure while maintaining natural, fully-Markovian transitions and allow via low Markov orders a broad range of expression. The ability to impose long-range dependencies using relational and regular constraints represents a critical achievement for AI composition, enabling researchers to focus on how to effectively use such tools to model music at the level of musically meaningful events and patterns.

## Notes

1Source code and data available at https://github.com/norkish/pop-star.

2i.e., values for variables—even values which do not violate constraints—are pruned if such values cannot possibly be part of a solution consistent with constraints.

## Competing Interests

The authors have no competing interests to declare.

## References

1. Barbieri, G., Pachet, F., Roy, P., and Esposti, M. D. (2012). Markov constraints for generating lyrics with style. In Proceedings of the European Conference on Artificial Intelligence, pages 115–120. IOS Press.

2. Bodily, P. M. and Ventura, D. (2021). Inferring structural constraints in musical sequences via multiple self-alignment. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 43.

3. Briot, J.-P. and Pachet, F. (2020). Deep learning for music generation: Challenges and directions. Neural Computing and Applications, 32(4):981–993. DOI: https://doi.org/10.1007/978-3-319-70163-9

4. Collins, T. and Laney, R. (2017). Computer–generated stylistic compositions with long–term repetitive and phrasal structure. Journal of Creative Music Systems, 1(2). DOI: https://doi.org/10.5920/JCMS.2017.02

5. Colton, S., Gow, J., Torres, P., and Cairns, P. A. (2010). Experiments in objet trouvé browsing. In Proceedings of the International Conference on Computational Creativity, pages 238–247.

6. Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. (2019). Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.

7. Davies, M. (2009). The 385+ million word corpus of contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2):159–190. DOI: https://doi.org/10.1075/ijcl.14.2.02dav

8. Hadjeres, G. and Nielsen, F. (2020). AnticipationRNN: Enforcing unary constraints in sequence generation, with application to interactive music generation. Neural Computing and Applications, 32(4):995–1005. DOI: https://doi.org/10.1007/s00521-018-3868-4

9. Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A., Hoffman, M., and Eck, D. (2018). Music Transformer: Generating music with long-term structure (2018). arXiv preprint arXiv:1809.04281.

10. Louie, R., Cohen, A., Huang, C.-Z. A., Terry, M., and Cai, C. J. (2020). Cococo: AI-steering tools for music novices co-creating with generative models. In 25th International Conference on Intelligent User Interfaces, Workshop on Human-AI Co-Creation with Generative Models.

11. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of the Association for Computational Linguistics (System Demonstrations), pages55–60. DOI: https://doi.org/10.3115/v1/P14-5010

12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.

13. Nunes, J. C., Ordanini, A., and Valsesia, F. (2015). The power of repetition: Repetitive lyrics in a song increase processing fluency and drive market success. Journal of Consumer Psychology, 25(2):187–199. DOI: https://doi.org/10.1016/j.jcps.2014.12.004

14. Oore, S., Simon, I., Dieleman, S., Eck, D., and Simonyan, K. (2020). This time with feeling: Learning expressive musical performance. Neural Computing and Applications, 32(4):955–967. DOI: https://doi.org/10.1007/s00521-018-3758-9

15. Pachet, F., Paris, S. C., Papadopoulos, A., and Roy, P. (2017). Sampling variations of sequences for structured music generation. In Proceedings of the International Society for Music Information Retrieval Conference, pages167–173. DOI: https://doi.org/10.4324/9781315621364-19

16. Pachet, F., Roy, P., Barbieri, G., and Paris, S. C. (2011). Finite-length Markov processes with constraints. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 635–642.

17. Papadopoulos, A., Pachet, F., Roy, P., and Sakellariou, J. (2015). Exact sampling for regular and Markov constraints with belief propagation. In Proceedings of the International Conference on Principles and Practice of Constraint Programming, pages341–350. Springer. DOI: https://doi.org/10.1007/978-3-319-23219-5_24

18. Papadopoulos, A., Roy, P., and Pachet, F. (2014). Avoiding plagiarism in Markov sequence generation. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, pages 2731–2737.

19. Perez, G. and Régin, J.-C. (2017). MDDs: Sampling and probability constraints. In International Conference on Principles and Practice of Constraint Programming, pages226–242. Springer. DOI: https://doi.org/10.1007/978-3-319-66158-2_15

20. Schulze, W. and Van Der Merwe, B. (2011). Music generation with Markov models. IEEE Annals of the History of Computing, 18(03):78–85. DOI: https://doi.org/10.1109/MMUL.2010.44

21. Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., and Woelfel, J. (2004). Sphinx-4: A flexible open source framework for speech recognition. Technical report, Sun Microsystems, Inc.

22. Widmer, G. (2016). Getting closer to the essence of music: The Con Espressione manifesto. ACM Transactions on Intelligent Systems and Technology (TIST), 8(2):1–13. DOI: https://doi.org/10.1145/2899004