Music is full of long-range dependencies. Examples include themes that recur at regular or irregular intervals; motifs across pitch, rhythm, and harmonic progression; and lengths of segments and phrases. The repetitive nature of music contributes to the the fluency with which a song is processed by the human mind and has been linked to the success with which music earns popularity and renown (Nunes et al., 2015).
Despite the significance of such long-range dependencies, nearly all computational models for algorithmic music composition exhibit significant limitations in their ability to model such dependencies. Resolving this discrepancy between the long-range dependencies in human-composed music and the lack thereof in computationally-composed music has been considered one of the foremost challenges for the next decade of music information research (Widmer, 2016). In short, the question of how to design a computational model for composing music demonstrating long-range dependencies remains an open and challenging problem for AI researchers.
Although several models of composition have been proposed, they generally fall into one of a few common types. Neural approaches constitute one popular approach to sequence generation. Long short-term memory (LSTM) models reliably produce unique and musically-coherent sequences (Oore et al., 2020) but are known to have issues of decaying memory that make predicting musical events and patterns over longer timespans problematic (Hadjeres and Nielsen, 2020). The more recent advent of transformers has made significant breakthroughs generating music with clear long-range dependencies (Huang et al., 2018). The major challenge with neural approaches, however, has been their lack of steerability and explainability, attributes which for many applications of computer-generated music (including those with which we concern ourselves) remain essential and of primary interest (Dathathri et al. 2019; Louie et al. 2020).
Markovian models represent a second common type of music composition model. Markov models are a class of probabilistic sequence models that fundamentally assume the composition of a sequence depends strictly on the immediate context, without regard for what precedes that context (Schulze and Van Der Merwe, 2011). Recent work has focused on imposing steerable, long-range dependencies in Markov models through the incorporation of constraints. Non-homogeneous Markov models (also called constrained Markov models) have demonstrated the ability to enforce unary constraints during the generative process, allowing some degree of control in generating sequences (Pachet et al., 2011). Barbieri et al. (2012) demonstrate that by pre-defining sets of unary constraints to have matching constraint criteria, sequences can be generated that evoke a sense of long-range dependency. This method works well in one sense, but the necessity of pre-defining the matching constraint criteria severely limits the results that can be obtained from the model.
Others have focused on the specific case of imposing regular constraints in sequence generation. Factor graph models impose regular constraints by combining an automaton representing regular constraints with a Markov model (Papadopoulos et al., 2015). Multi-valued Decision Diagrams (MDDs) can be used to model a combination of Markov and regular constraints (Perez and Régin, 2017). To the extent that patterns of musical repetition can be represented as regular constraints, these models represent effective solutions.
Despite numerous efforts, methods have thus far not been found to impose constraints to control generation in deep learning and neural architectures (Briot and Pachet, 2020; Hadjeres and Nielsen, 2020).
Besides constraint-driven approaches, another common approach to creating long-range repetitive and phrasal structure in music is using a form of “copy-paste”, generating musical subsequences that are then duplicated at intervals (Collins and Laney, 2017). This approach is fundamentally non-Markovian and commonly leads to unnatural transitions at splice sites. Pachet et al. (2017) present a similar “copy-paste” method for creating structured music lead sheets but use a controlled variation mechanism to ensure that copies are contextually situated to satisfy Markovian properties.
We present a novel solution to address the long-range dependency problem. To our knowledge, this is the first method which has been demonstrated to generate compositions according to Markov probabilities—ensuring a local cohesion throughout the sequence— with steerable, long-range, dynamic dependencies, allowing the imposition of wide-spread motifs and repetitions. We demonstrate how the model is capable of generating music that simultaneously exhibits a variety of patterns and motifs in pitch, rhythm, harmony, and lyrics.
Our approach involves the combination of nonunary constraints with a Markov model to allow the enforcement of long-range dependencies. Our presentation will be divided into two sections. In the first section, we demonstrate how nonunary constraints are effectuated within the context of the Markov window. Dependencies in such a limited context can hardly be considered long-range, but resolving how nonunary constraints are handled within the Markov window is a necessary complement to considering their implementation beyond the Markov window and provides opportunity to demonstrate on a smaller scale some powerful applications of nonunary constraints. In the second section, we demonstrate the implementation of nonunary constraints spanning arbitrary ranges beyond the Markov window.^{1}
Although our main purpose is ultimately to define a model capable of enforcing long-range dependencies, any consideration of constraints in the context of Markov models will require an understanding of how constraints behave in the short-range, namely within the range defined by the Markov window. As such, we treat the issue of short-range dependencies first. It is worth noting that the method for implementing such short-range dependencies—aside from being more computationally efficient than the method for implementing dependencies beyond the Markov window—is useful on its own as a model for enforcing nonunary constraints independent of whether long-range dependencies are also added.
The implementation and application of relational constraints (which we define below) has not yet been carefully examined in the context of Markov models. In our experience in implementing such constraints, we happened upon some unique applications of these constraints that we had not initially thought of that are worthy of mention. These are also discussed in this section.
In speaking of long- and short-range dependencies in a sequence of music, we can formally define a dependency between two (or more) points in terms of what in set theory are called relations. An n-ary relation ρ defined over sets Σ_{1}, …, Σ_{n} is a subset of the Cartesian product of Σ_{1} × … × Σ_{n}, or in other words a set of tuples (x_{1}, …, x_{n}) such that x_{i} ∈ Σ_{i} for i ∈ [1, n]. We will deal primarily with n-ary relations where n = 1 and n = 2, called unary and binary relations respectively. In general Σ_{1}, …, Σ_{n} can be different sets, but where we apply these relations to sequences of elements deriving from the same set, we assume relations over a single set Σ. Examples of unary relations might include the set {x | x is a note in a C major chord} or {x | x is a word rhyming with ‘today’}. Examples of binary relations might include the set {(x_{1},x_{2}) | x_{1} and x_{2} are notes of the same pitch} or {(x_{1},x_{2})| x_{1} is a word that rhymes with x_{2}}.
Given a sequence of random variables X = (X_{1}, …, X_{n}) we define a unary relational constraint or simply unary constraint (X_{i},ρ) as a constraint that is satisfied if and only if X_{i} = x_{i} and (x_{i}) ∈ ρ. We similarly define a binary relational constraint or simply binary constraint (X_{i},X_{j},ρ) as a constraint that is satisfied if and only if X_{i} = x_{i}, X_{j} = x_{j}, and (x_{i},x_{j}) ∈ ρ. We assume an analogous definition for an n-ary relational constraint. A set of constraints is satisfied if and only if all constraints in the set are satisfied. An example of a unary constraint is (X_{3},{x | x is a verb}).
A Markov model M = (Σ,P_{i},P_{T}) consists of an alphabet of tokens Σ, a set of initial probabilities P_{i}: Σ→ [0.0,1.0] such that Σ_{x}∈Σ P_{i}(x) = 1.0, and a set of transition probabilities P_{T}: Σ × Σ → [0.0,1.0] such that ∀ x ∈ Σ, Σ_{y∈Σ}P_{T}(x, y) = 1.0. The probability according to M of a sequence s = (s_{0}, …, s_{n}) is
As defined by Pachet et al. (2011), a finite-length non-homogeneous Markov model (NHMM) $\tilde{M}$ is created given a finite length value l, a trained Markov model M = (Σ,P_{i},P_{T}), and a set C of unary constraints. The value of l determines the length of the sequence X = (X_{1}, …, X_{1}) that $\tilde{M}$ models. We can partition C into subsets C_{1}, …, C_{1} such that subset C_{i} represents constraints applying to X_{i}. $\tilde{M}$ has the following properties:
$\tilde{M}$ is constructed by defining transition matrices M_{1}, …, M_{l–}_{1} where each matrix M_{i} is initialized as a copy of the transition matrix P_{T} of M. Each M_{i} is then modified according to the constraints in C_{i}. The constraint satisfaction problem is made arc-consistent,^{2} and probabilities are normalized to ensure solutions are assigned the same relative proportion of the probability density (see Pachet et al. (2011) for details on enforcing arc-consistency and normalization).
The properties, construction, and normalization of $\tilde{M}$ are the same regardless of the Markov order of M. M has traditionally been presented as a single-order Markov model M_{1} which generates a sequence according to the probability function P(X_{1}, …, X_{n}) = P(X_{1}) · P(X_{2}|X_{1}) · … · P(X_{n}|X_{n}_{–1}). However, M may just as easily be a d-order Markov model, M_{d}, which generates a sequence with a longer memory such that P(X_{i}|X_{1}, …, X_{i}_{–1}) = P(X_{i}|X_{i–d,} …, X_{i}_{–1}). The fact that the order of M does not affect the NHMM construction falls out immediately from the observation that M_{d} can be created using a single-order Markov model on an alphabet of d-grams (Papadopoulos et al., 2015). More precisely this means changing the finite state space alphabet Σ of M_{1} (where each element a_{i} ∈ Σ is a single element) to a state space Σ_{d} = {A_{1}, …, A_{m}} where each element A_{j} ∈ Σ_{d} represents a d-tuple of elements from Σ.
In M_{d} a transition from tuple A_{j} to tuple A_{k} exists if and only if the (d–1)-length suffix of A_{j} and the (d–1)-length prefix of A_{k} are identical. Thus when probabilistically sampling a sequence from M_{d} or ${\tilde{M}}_{d}$ , all unary tokens in the first sampled sequence-state and the final token of each subsequently sampled sequence-state are concatenated to form the generated token sequence (e.g., see Figures 1 and 2). Thus a d-order NHMM of length l generates sequences of length (d + l – 1).
The NHMM model was originally defined to enforce unary constraints. With a d-order NHMM model, each state in the model represents a d-tuple. In this formulation, short-range dependencies can already be enforced if the range of the dependency is less than or equal to d. For example, when generating a 6-length sequence of words with a 3rd-order NHMM, we can enforce a rhyming constraint between the 4th and 6th words (a range of 3) by zeroing out transitions in M_{3} to 3-tuples in which the first and last words do not rhyme (see Figure 2). Thus the Markov order defines a range within which short-term dependencies can be modeled using unary constraints in the traditional NHMM model. Because the NHMM model can be constructed in polynomial time, this represents an efficient method for modeling short-range dependencies.
In a traditional d-order NHMM, constraints are unary and thus can only be applied to prune d-tuples at one state position, X_{i}. However, with only a slight adjustment, a d-order NHMM can be constructed to allow binary constraints applied to d-tuples at two adjacent positions, X_{i-}_{1} and X_{i}. To implement this adjustment, consider that to enforce a unary constraint (X_{i,}ρ) in an NHMM involves pruning the transition matrix M_{i–}_{1}. However, M_{i–}_{1} represents transitions from X_{i}_{–}_{1} to X_{i}, and thus we can choose to zero out transition probabilities based on constraint criteria that depend on both X_{i–}_{1} and X_{i}. If a particular (x_{i–}_{1},x_{i}) pair does not satisfy a binary constraint (X_{i–}_{1},X_{i,}ρ), we zero the transition from x_{i–}_{1} to x_{i} in M_{i–}_{1}. Thus without increasing the polynomial runtime required for NHMM construction, we can extend the range of short-range dependencies from the Markov order (d) to d + 1. Considering that Markov orders are commonly less than 5, this effectively increases the range of enforceable short-range dependencies in NHMMs by 20% or more.
See endnote 1 for a link to our source code for building d-order NHMMs with binary constraints.
There are many different types of dependencies that can be represented as relational constraints in higher-order NHMMs. We will look at two examples here. These same types of dependencies are possible and are subsequently demonstrated as long-range dependencies, but the more narrow range of short-range dependencies provides a better context for an introduction. To facilitate ease of understanding we will use natural language to demonstrate the examples, but they are just as powerfully applied to sequences of pitches, rhythms, chords, or other musical viewpoints. When referencing parts of speech (POS), we use the POS tags from the Penn Treebank Project.^{3} For all POS labeling we use the Stanford CoreNLP Toolkit Manning et al. (2014).
As mentioned in the introduction, one way of evoking the impression that two positions of a composition are related (e.g., in a repeated motif) is to impose identical or related unary constraints at each position. For example, Barbieri et al. (2012) use an NHMM to generate lyrics such that last word variables in the first and second lines are respectively constrained to be the word “today” and to rhyme with “today”. Note that once the constraints have been defined, the variables representing the words at these two positions are completely independent. We call this type of constraint a static constraint because all of the criteria needed to make the assignment to one variable independent of the other is statically defined before the model is built.
Using static constraints to create (the illusion of) dependencies comes with huge limitations. As an example, consider that the rhyme group to which “today” belongs is only one of thousands of possible rhyme groups, meaning the range of viable solutions that are accessible to the model is reduced by several orders of magnitude. In the case of music where we might constrain pitches, rhythms, or chords to match in order to create a motif, static constraints have a similarly reducing effect on the solution space, with the exact reduction being equal to the size of the pitch, rhythm, and chord domains. This means an 88× reduction to model a pitch dependency with static constraints (if we limit ourselves to the range of pitches on a standard keyboard); a 15× reduction to model a rhythm dependency (if our shortest duration is a 16th note, our longest a whole note, and allowing double-dotted durations); and a 4,000× or more reduction to model chords.^{4} This problem grows exponentially with each additional constraint set needed to construct the motif or other motifs.
Relational constraints in higher-order NHMMs enable a class of dependency we call dynamic dependencies. A dynamic dependency is true dependency in that the assignment to one variable cannot be made independent of another. In that sense, the precise criteria by which dynamic dependency is satisfied at a particular sequence position is determined at runtime as a function of assignments at other positions. Using a dynamic constraint, the system determines dynamically whether a particular pair of words belong to the same rhyme group without needing to statically define the rhyme group of either word ahead of time. Dynamic dependencies recover all of the expressive power lost by the use of static constraints to create dependencies.
A dynamic dependency is achieved in d-order NHMMs by testing whether or not the d + 1 length sequence deriving from assignments to two adjacent variables X_{i} and X_{i+}_{1} represents a subsequence in which two elements of interest satisfy the desired property (e.g., rhyme). Without additional modifications to the NHMM, this requires that the dynamically constrained positions must be within d + 1 positions of each other, thereby maintaining the Markov property (see Figure 2). This makes dynamic constraints particularly well-suited for genres like rap where rhymes are often closely situated.
In many applications the dynamic constraints of interest relate elements at very distant positions. Such dynamic constraints are not well-suited for the approach we have outlined thus far. As the Markov order d increases, fewer satisfying solutions can be generated and solutions tend towards replicating exact phrases from the training corpus. Papadopoulos et al. (2014) demonstrate a regular constraint solution to the problem of sampling exact phrases above a certain length from training data. Regular constraints are also a viable alternative for generating music with long-range dependencies, which we demonstrate in Section 3.
The examples in Section 2.4 use values of d in the range 4 ≤ d ≤ 8. In these cases we have elected to retain exact phrases both for the sake of demonstration and because there remains some possibly important aesthetic value in repurposing existing artifacts in novel problem domains (e.g., an objet trouvé). In fields such as computational creativity these solutions represent a significant contribution in their own right (Colton et al., 2010).
Often when composing a sequence with dependencies, it is desirable to allow some flexibility in precisely where the dependency is satisfied. We call this type of dependency a floating dependency. For example, rather than constrain the ith position of a phrase to be a particular POS (e.g., a noun), we can constrain the model such that at least one token in the range of positions [i – d, i] must be a particular POS. Using a d-order Markov model, this dependency can be enforced using the binary constraint (X_{i–d},X_{i–d+1},ρ) where ρ = {x, y |the dependency is satisfied by at least one element in the d-tuples x or y}. Floating dependencies are useful for granting the model greater flexibility in generation.
An example of the application of floating dependencies is in defining semantic constraints. Barbieri et al. (2012) use an NHMM to generate a poem related to the concept of “music”. To do so they define unary constraints at random, fixed positions to impose the corresponding word be semantically related to “music” in addition to imposing other rhythm and POS constraints. By their own admission, the approach has limited flexibility because of having to define unary constraints at specific positions. Considering that theirs was a 2nd-order NHMM, we might reasonably expect that reformulation of their constraint as a short-range floating dependency, allowing the semantically-relevant word to be generated at one of 3 positions, would triple the number of “music” poems that the model could compose.
Floating dependencies can also be used to not over-restrict syllable-level NHMMs when imposing word-level constraints. Using a Markov process over words, Barbieri et al. (2012) are able to constrain according to word-level POS templates. For example, the word-level POS template for the phrase “Yesterday, love was such an easy game to play” is denoted as [NN, NN, VBD, JJ, DT, JJ, NN, TO, VB]. Although we can impose the same constraints using a syllable-level POS template (e.g., [NN, NN, NN, NN, VBD, JJ, DT, JJ, JJ, NN, TO, VB]), note that doing so reduces the expressiveness of our syllable-level model by up to a factor of 2^{l–1} for a sequence of length l insofar as we limit the model’s ability to determine the numbers of syllables per word and words per phrase. One solution is to place POS constraints at only a subset of positions (e.g., [NN, , , NN, VBD, JJ, DT, JJ, , NN, TO, VB]), allowing the model to fill in the gaps. This solution, however, still severely restricts the model’s expressiveness. It also requires statically defining a sequence of POS tags allowed at each position, essentially limiting the number of allowable POS sequences to one.
Floating dependencies solve this problem by providing a means of imposing word-level POS templates without sacrificing expressiveness and with the ability to allow multiple valid POS sequences. In a syllable-level d-order NHMM, each transition matrix M_{i} essentially encodes all possible (d + 1)-length syllable sequences that could result from assignments to the variables X_{i} and X_{i}_{+1}. Associated with each syllable is the POS of the word from which the syllable was derived, enabling extraction of a (d + 1)-length syllable-level POS sequence (e.g., the syllable-level POS sequence for the syllable sequence “lla·ma wear·ing pol·ka-dot pa·ja·mas” would be [NN,NN,VBG,VBG,NN,NN,NN,NNS,NNS,NNS]). Implemented as a floating dependency (X_{i},X_{i}_{+1},ρ}), each such (d + 1)-length syllable-level POS sequence can be tested against a word-level POS template (e.g., ρ = {x, y |the syllable-level POS sequence represented by the tuples x and y must equal [NN,VBG,NN(S)] after condensing consecutive repeat tags}). In this way word-level POS constraints can be imposed in syllable-level NHMMs without specifying a precise position where they must be satisfied. Furthermore, ρ can be made to allow multiple POS templates (see Figure 3). In this way syllable-level sequences can be validated using many different word-level lexical analyses (e.g., noun phrase classifiers).
In each of the three following examples we demonstrate a unique advantage afforded by dynamic and/or floating dependencies in NHMMs.
We demonstrate the use of dynamic dependencies in the context of a lyric-generation problem which we call the Down by the Bay (DBTB) problem. “Down by the Bay” is a traditional song requiring improvisation of novel lyrics which conform with rhythmic, rhyming, and POS patterns representative of the song’s characteristic structure.
After careful consideration of several prototypical solutions to the DBTB problem, we found that solutions generally followed one of 5 rhythmic templates (i.e., sequences of stressed [1] and unstressed [0] syllables): [011001], [0110101], [01010010], [01101011], or [01010101010]. Because there are variable satisfactory lengths and the length l for an NHMM must be fixed a priori, we create a syllable-level d-order NHMM for each rhythmic template r = (r_{1}, …, r_{n})—where d equals one less than the length of the maximal subsequence of r that starts and ends with 1, i.e., 4 ≤ d ≤ 8—with length l = n – d + 1 (e.g., see Table 1). For each model we extracted a set of constraints on the sequence X= (X_{1}, …, X_{1}) to be generated for solving the DBTB problem as follows:
NHMM 1 | NHMM 2 | NHMM 3 | NHMM 4 | NHMM 5 | |
NHM Order (d) | 4 | 5 | 5 | 6 | 8 |
NHM Length (l) | 3 | 3 | 4 | 3 | 4 |
Syllable Seq Length (n) | 6 | 7 | 8 | 8 | 11 |
Rhythmic Template (r) | [011001] | [0110101] | [01010010] | [01101011] | [01010101010] |
Training Sentences | 3,892,039 | 2,654,884 | 2,654,884 | 2,051,040 | 1,239,850 |
Training Pronunciations | 366,062,046 | 286,075,704 | 286,075,704 | 255,086,072 | 208,330,754 |
Solutions Generated | 30 | 5 | 5 | 4 | Not Satisfiable |
Generated Example | “a dish of pickled fish” | “a cot and a chamber pot” | “a pillar that was a mirror” | “a mouse or a rat in the house” | n/a |
Average Novelty | 3.65 | 3.66 | 4.13 | 3.40 | n/a |
Average Rhyme | 4.18 | 4.21 | 1.94 | 4.00 | n/a |
Average Rhythm | 3.12 | 3.30 | 2.13 | 3.16 | n/a |
Average Amusement | 2.53 | 2.39 | 2.06 | 2.48 | n/a |
Average Likability | 2.51 | 2.66 | 2.17 | 2.82 | n/a |
Note that dependencies 4 and 5 are examples of dynamic dependencies spanning the rhyming positions. Dependency 6 is an example of a floating dependency that uses a syntax tree with POS templates derived from traditional DBTB solutions (see Figure 3).
As training data for our DBTB models we used the Corpus of Contemporary American English (COCA) (Davies, 2009). To improve parsing we selected only complete phrases (as delimited by any non-alphabetic or apostrophe characters) which had a maximum of 30 space-delimited word tokens. Sentence pronunciation was inferred using word pronunciations from the CMU dictionary^{5} and the CMUSphinx grapheme-to-phoneme converter (Walker et al. 2004). Models were trained on each unique pronunciation of the sentence, with each pronunciation being down-weighted in the model proportional to the number of pronunciations for the sentence.
We trained five NHMMs, one for each of the rhythmic templates identified above. All but one of the five trained models found satisfying solutions (see Table 1). Each model was trained on sentences with syllable sequence length l_{s} ∈ [d,30]. The system, implemented in Java 1.8, trained with 24 cores and 256 GB of RAM for 5 hours and 43 minutes.
To evaluate the effectiveness of higher-order NHMMs, we performed a quantitative assessment of subjective reactions to the DBTB solutions generated by our models. We obtained 62 human-generated solutions by inviting university faculty and students to “come up with your own novel ending”. Of these we sampled a random subset to obtain an equal number of computer-generated and human generated solutions (44 of each). We conducted 94 surveys in which participants were asked to rate five DBTB solutions (randomly and evenly sampled from our mixed pool) on Likert scales for novelty, rhyme, rhythm, amusement, and overall likability (each solution thus being rated an average of 5.34 times). Results are shown in Figure 4 and Table 1.
The top five highest-scoring computer-generated solutions, their scores (averaged over the 5 criteria), and the NHMM that created them were:
The five median-scoring computer-generated solutions were:
The five lowest-scoring computer-generated solutions were:
We found that the novelty scores of computer-generated solutions were competitive with those of solutions generated by humans. The rhyme scores were also very competitive, with the difference being partially explained by our failure to consider the matching of syllable onsets in multisyllabic words (e.g., “pillar”,”mirror”) in NHMM 3. We hypothesize that the difference observed in the rhythm score may be due to the confounding variables of rhyme and grammaticality. For example, the poor rhyming performance of NHMM 3 (the only model with multisyllabic rhymes) may have carried over into the ratings for other aspects of NHMM 3 solutions. Likewise it seems a reasonable hypothesis that, although poor grammar does not affect rhythm, it may nonetheless be perceived to affect the rhythm; however, we did not include grammaticality in our assessment. The amusement and likability scores, with which the computer struggled most, may have been affected by the genre-appropriateness of some of the computer-generated solutions (e.g., “a flood of spurting blood”, “an orphan and an abortion”).
To demonstrate the generality of floating dependencies, we designed two different syllable-level NHMMs to generate haiku with floating semantic dependencies. The first was a 5th-order NHMM of length 13 with a nature-themed floating semantic constraint applied over the first 5 syllables. The second was a 4th-order NHMM of length 14. In this model we applied a floating word-level POS template constraint (similar to those described for the DBTB models) over the last 5 syllables of each line. Each line-specific constraint had a syntax tree populated with word-level POS templates parsed from corresponding lines in a database of existing haiku.
We imposed a beauty/earth-themed floating semantic constraint over the last 4 syllables using word2vec (Mikolov et al. 2013) to ascertain semantic relatedness between words. In both models, each line was constrained to start and end with word-starting and word-ending syllables. Figure 5 shows an example of manually selected haiku generated from each model. The haiku from the first model was chosen from among 7 randomly generated haiku (all 7 represented exact sequences from the training data). The haiku from the second model was chosen from among 14 randomly generated haiku (only 1 out of 14 from the second model was an exact sequence from the training data). We again note that solutions exist to prevent plagiarism of exact sequences from the training data, but we did not employ such methods in these models (Papadopoulos et al., 2014).
For generating prosodic rhythm, we designed a 4th-order NHMM over rhythm tokens of length 6 to generate rhythms for the lyrics “No more monkeys jumping on the bed”. We trained the model on lyric/rhythm sequences from the Wikifonia dataset. Given the rhythmic stress template of the lyrics, [111010101], the model imposes floating stress constraints requiring only two thirds of syllable stresses to match appropriately to emphasized rhythmic positions (one constraint over the first five syllables and another over the last four). The model has 4/4 floating time signature constraints at all positions (see Figure 6).
In the first example in Figure 6, a rhythm is generated which places all six stressed syllables on emphasized rhythmic positions. However, in the second example, only four of the six stressed syllables align with emphasized rhythmic positions (the minimum required to satisfy the floating constraint). As a result, two stressed syllables fall on offbeats, creating a syncopated effect with accents in unstressed bar positions. This example demonstrates that not only can the model generate rhythms to complement arbitrary lyrical sequences, but that it is capable of meeting constraints while being given some creative autonomy to break rules or constraints to some controlled degree. In a similar fashion, floating constraints might be used to allow autonomous music generation systems to create lyrical, melodic, harmonic, or rhythmic variations on a motif, enabling generation of songs that adhere to broad forms of structural repetition but with nuances at each repetition.
Most motifs and dependencies in music occur across ranges far longer than reasonable Markov order. To address this need we demonstrate a new method for modeling long-range dependencies with NHMMs. To our knowledge, this is the first method which has been demonstrated to generate compositions according to Markov probabilities, ensuring a local cohesion throughout the sequence, with long-range, dynamic dependencies, allowing the imposition of wide-spread motifs and repetitions. There are no limits on the range length of dependencies using this approach. The Markov order can be chosen without impact on the ability to impose long-range dependencies. We demonstrate the abilities of this model by generating sheet music compositions with verse-chorus structures, including repeating pitch, rhythm, harmonic, and lyric motifs.
An important key to generating music with long-range dependencies is the recognition that n-ary relational constraints can, in fact, be modeled using regular constraints. This may at first appear to be at odds with computational theory. Consider, for example, that the set of all palindromes—a quintessential example of a relational constraint—is not a regular language. However—and this is the key—the set of all palindromes of a particular, finite length is a regular language. Thus given a finite length of the composition to be generated, we can model long-range dependencies using regular constraints. As any regular language can be recognized/generated by a deterministic finite automaton (DFA), we take the approach of creating a DFA that incorporates all long-range dependencies and then combine this DFA with an NHMM in order to ensure that sequences recognized by the DFA are sampled with the correct probability.
The exposition of the method for imposing long-range dependencies relies on the definitions of relational constraints, Markov models, and NHMMs provided in Section 2.1. In addition, we define a DFA as a quintuple A = ⟨Q, Σ, δ, q_{0}, F⟩ in which:
A sequence s = (s_{1}, …, s_{n}) is accepted by A iff there exists a sequence q_{0}, …, q_{n} of states such that ∀i ∈ [1,n], δ(q_{i}_{–1,}s_{i}) = q_{i} and q_{n} ∈ F. The language L(A) is the set of all sequences which A accepts.
Note that when combining a DFA and Markov model, the domain Σ for variables X_{1}, …, X_{n} is the same as the alphabet Σ for the DFA.
Given a set of binary relational constraints C = {(X_{i},X_{j},ρ)}, a set of valid start symbols I ⊂ Σ, a set of valid transitions T = {s_{i}s_{j} | s_{i} ∈ Σ and s_{j} ∈ Σ}, and a length n, Algorithm 1 creates a RELATIONALautomaton, i.e., a DFA A such that L(A) is the set of all sequences s ∈ Σ^{n} such that s_{1} ∈ I; s_{i–1}s_{i} ∈ T for i ∈ [1,n]; and (s_{j},s_{k}) ∈ ρ for all relational constraints (X_{j},X_{k},ρ) in C. Where I and T are determined from P_{i} and P_{T} of a Markov model M, this means P_{m}(s) > 0.
The algorithm takes the general approach of building breadthwise a tree-like DFA where each layer (of edges) in the tree represents a position in the sequence to be generated. Each state q_{i} in the DFA is associated with one of the variables X_{j} and is defined by a triple ⟨C_{qi},V_{qi},q_{i}⟩ where C_{qi} is a set of relational constraints representing dependencies whose range includes position j; V_{qi} ⊆ Σ is the set of valid edge labels allowed by T for edges leaving the state; and q_{i} is a reference to the state itself. Whereas a binary constraint in C relates two variables, a constraint (a,X_{k},ρ) in C_{qi} relates an element a ∈ Σ to a variable X_{k}.
When expanding the DFA from a state ⟨C_{qi},V_{qi},q_{i}⟩ associated with X_{j}, a new path is considered for every label v ∈ V_{qi}. In order to build the path, the algorithm first computes what would be the new triple ⟨C_{q}′,V_{q}′,q′⟩ for the new state reached via v in order to check whether a state with the same constraint and edge labels already exists. The new set C_{q}′ inherits all binary constraints from C_{qi} except those relating an element a ∈ Σ to the variable X_{j}. For each constraint (X_{j},X_{k},ρ) ∈ C, we also add a constraint (v,X_{k},ρ) to C_{q}′. The new set V_{q}′ is populated with the set of all labels v_{2} such that 1) v → v_{2} is a valid transition according to T and 2) (a,v_{2}) ∈ ρ for any constraint (a,X_{j}_{+1,}ρ) in C_{q}′. In practice, V_{q}′ can be further constrained by unary constraints of the form (X_{j}_{+1}, ρ) to further prune undesirable sequences. Prior to adding q′ and assuming V_{q}′ is not empty (a sign of a dead state), we check whether a state associated with sequence position j + 1 with the same constraint set C_{q}′ and edge label set V_{q}′ already exists. In the last state layer there is a single accepting state to which all states in the previous state layer (i.e., those states associated with the last variable, X_{n}) connect via all valid labels in their respective edge label sets.
Figure 7 shows the DFA built using Algorithm 1 with n = 4; C = {(X_{1},X_{4},ρ_{rhyme})} (where ρ_{rhyme} represents the set of rhyming word pairs); I = {Mary, Clay}; and T derived from the non-zero transitions represented in the Markov model shown in Figure 8.
The time and space requirements for the algorithm vary significantly depending on the inputs. Each constraint in C at a new position splits the path into a number of paths dependent on the number and distribution of transitions in T and the restrictiveness of the constraint’s relation ρ. Divergent paths will reconverge once all overlapping constraints have been resolved. The number of paths and states can also be reduced by adding unary constraints, including negated relational constraints, increasing the number of symbols per label (i.e., increase the Markov order), and reducing the size of the transition set T.
One way to optimize the construction of a RELATIONAL automaton is to take advantage of binary relational constraints for which ρ is an equivalence relation (i.e., reflexive, symmetric, and transitive). In such cases many partially instantiated constraints become equivalent. For example, under the equivalence relation of rhyming (and allowing for a word to rhyme with itself), the following partially instantiated binary relational constraints are equivalent: (Mary,X_{4},ρ), (Fairy,X_{4},ρ), (Carey,X_{4},ρ). These constraints can all be represented using a generalized constraint that uses the equivalence class E_{Mary} representing the set of all rhyming word pairs that rhyme with “Mary”: (E_{Mary},X_{4},ρ). This facilitates the combining of states in the DFA.
Given a RELATIONAL automaton A, a Markov model M, a set of any additional unary constraints U, and a length n we can construct an NHMM which samples sequences from the target distribution P* defined as:
where for sequences (X_{1}, …, X_{n}) which fail to satisfy U the term P_{U}(X_{i}) = 0 for some i ∈ [1,n], resulting in P*(X_{1}, …, X_{n}) = 0. It is more efficient to apply constraints in U when creating A because it further limits the size of the tree. We choose not to demonstrate this here for three reasons. First, the primary goal in presenting the RELATIONAL automaton is for its ability to enforce nonunary constraints, and our aim is to keep the explication of its construction as clear and simple as possible. Second, the incorporation of unary constraints into Algorithm 1 can be intuited relatively easily: for each unary constraint (X_{j},ρ), prune non-satisfying elements from the edge label set V_{qi} for states q_{i} associated with X_{j}. Third, the construction of an NHMM presented by Pachet et al. (2011) by default assumes as input a set of unary constraints, allowing us an already defined method for dealing with these types of constraints.
As automata lack probabilities, we use the RELATIONAL automaton A to construct an NHMM as follows. Given an automaton A = ⟨Q, Σ, δ, q_{0}, F⟩, a Markov model M= (Σ,P_{i},P_{T}) (the same as that used to construct A), a set of unary constraints U (partitionable into subsets U_{1}, …, U_{n} such that subset U_{i} represents unary constraints applying to X_{i}), and a length n, Algorithm 2 builds a new “state-sensitive” Markov model M′ which incorporates the relational constraints represented by the regular automaton A (see Figure 9). Each Markov label in M′ represents a label-state pair ⟨x,q⟩ where x ∈ Σ is a label of the original Markov model M and q∈ Q is a state of the automaton A. Whereas with M we can directly sample values x_{1}, …, x_{n} for a sequence of random variables X = (x_{1}, …, x_{n}), in the “state-sensitive” model M′ we first sample values ${x}_{1}^{\prime}$ ,…, ${x}_{n}^{\prime}$ for a sequence of random label-state variables X′ = (${x}_{1}^{\prime}$ , …, ${x}_{n}^{\prime}$ ). We obtain X through the assignment X = (${x}_{1}^{\prime}$ .label, …, ${x}_{n}^{\prime}$ .label) where ${x}_{n}^{\prime}$ .label is the value of the label in the label-state pair ${x}_{i}^{\prime}$ .
The set of initial probabilities ${P}_{I}^{\prime}$ of M′ is defined as
and the transition probability function ${P}_{T}^{\prime}$ of M′ is defined as
A new set of unary “state-sensitive” constraints U′ is also created such that for each unary constraint (X_{i},ρ) ∈ U, an equivalent constraint (${x}_{i}^{\prime}$ ,{x′ | x′.label ∈ ρ}) is added to U′. An accepting constraint (${x}_{n}^{\prime}$ ,{x′ | x′.state ∈ F}) is also added to U′ to ensure that all valid Markov sequences from M’ end with a label-state pair x′ = (x,q) where q ∈ F.
The final step of Algorithm 2 constructs an NHMM from the unnormalized M′, U′, and n using the construction set forth by Pachet et al. (2011) which enforces arc-consistency and normalizes probabilities to create the target distribution P*.
Papadopoulos et al. (2015) demonstrate a mechanism for computing P* that uses factor graphs with belief propagation for sampling sequences with exact probabilities according to the original distribution. In their solution a backward phase computes the impact on each sequential position i in the factor graph of the sub-factor graph representing all positions “to the right” of i. A forward phase computes the marginal distribution over values for each sequence position i given the partial instantiation for variables representing positions “to the left” of i and simultaneously samples a sequence. A potential drawback to this solution is that though the backward phase is completed once, the forward phase is repeated each time a sequence is sampled.
The approach we present, shown in Algorithm 2, samples sequences under regular and Markov constraints with exact probability using an NHMM. An NHMM computes all marginal distributions ahead of time, speeding up sampling time at the cost of an increased build time (see Figure 10). This makes it a better choice for online sampling or when sampling larger sets of sequences. For a detailed explanation of NHMMs and the method of their construction (referenced as “constructNHMM ()” in Algorithm 2) we refer the reader to Pachet et al. (2011).
RELATIONAL automata are effective tools for imposing both Markovian transitional constraints (local, horizontal structure) and relational constraints (global, long-range structure). This is well-demonstrated in using binary relational constraints to constrain parallel interrelated sequences, as we show in the following example of generating a musical sequence with global verse-chorus structure.
Verse-chorus structure is effected in lyrical sectional-form music when subsequences of different types or viewpoints (e.g., chords, pitches, rhythms, and lyrics) are repeated at intervals throughout the musical sequence. For example, a verse is generally regarded as a segment where the chords, pitches, and rhythms are repeated, but the lyrics are different and a chorus is generally regarded as a segment where all four of those viewpoints are repeated.
Repetition of subsequences requires the use of a series of binary relational constraints using the equality binary relation. Rather than hand-craft a set of relational constraints C_{v} for each viewpoint v, we learn existing patterns of repetition from normalized data and then translate these patterns into sets of binary relational constraints. For each of four viewpoints (chords (H), rhythm (R), lyrics (L), and pitch (P)) we perform multi-Smith-Waterman (mSW) self-alignment on the discretized musical sequence using parameterizations reported in Bodily and Ventura (2021) on an existing composition. An example of structural patterns learned from the song Twinkle, Twinkle, Little Star is shown in Figure 11. For each viewpoint v ∈ {H,R,L,P} the n × n alignment matrix (where n is the length of the song) identifies matching regions (signified by dark red diagonals) in viewpoint v of the training song. For a non-zero entry in row i, column j of the matrix we extract a binary relational constraint (X_{i},X_{j},{x,y|x = y}).
In addition to the global, long-range structure imposed by these binary relational constraints, we added the following constraints:
Chord. Because binary relational constraints are designed to enforce structural repetition, it is not uncommon for generated sequences (particularly those from models trained on small data sets) to become overly repetitive, repeating structure even when it is not enforced. To counteract this effect in the chord sequence, we added a second set of binary relational constraints which constrained a subset of positions within regions not aligned via the mSW alignment to not match according to the equality relation. The model is constrained to generate sequences starting and ending with the same chord, namely a C major or A minor chord.
Rhythm. We add unary constraints to make the generated rhythmic sequence end with a half-note rhythm to evoke a sense of finality.
Lyric. The length (in syllables) of the lyric sequence n_{1} is derived from the number of sampled rhythm tokens representing non-rest rhythmic onsets. Prosody (i.e., aligning stressed syllables with stressed beats) is achieved by constraining syllables for offbeat rhythms to be unstressed. Rhyming is effectuated using a second binary relational constraint set ${C}_{L}^{\prime}$ ={(X_{i},X_{j},{x,y | x and y are rhyming syllables })}. ${C}_{L}^{\prime}$ is constructed such that the last syllables in measures 2 and 4 rhyme and the last syllables in measures 6 and 8 rhyme.
Pitch. We constrain the last pitch in the composition to be the root of the final chord. For all positions i we constrain the pitch at i to equal one of the pitches defining the chord at position i as per the previously sampled harmonic sequence.
We tried different Markov orders for the models used to generate each of the view points. We found that for generating eighth-note sequences of chords, a Markov order of 4–5 worked well. For eighth-note pitch sequences, 3–4 worked well. For eighth-note sequences of rhythms, 3–4 worked well. And for lyrics (we used a syllable-level model), a Markov order of 4–5 worked well.
The Markov models M_{H} and M_{P} for the chord and pitch sequences were trained on the chord and pitch sequence data in Somewhere Over the Rainbow. The rhythmic Markov model M_{R} was trained on rhythms from 5 different songs to give the model a rich choice of rhythmic variation. The lyric Markov model M_{1} was trained on the lyrics from the Beatle’s Hey Jude and John Lennon’s Imagine.
Given the relational constraints C_{v}, the Markov model M_{v}, and these additional binary and unary constraints, we generate a RELATIONAL automaton A_{v} and a REGULAR NHMM ${x}_{1}^{\prime}$ _{v} for each viewpoint v ∈ {H,R,L,P}. The chord, rhythm, lyrics, and pitch sequences shown in Figure 12 were sampled from their respective NHMMs with probabilities 4.9e-5, 7.7e-6, 0.032, and 2.0e-3 respectively. Source code for building RELATIONAL automata and REGULAR NHMMs is available at https://github.com/norkish/pop-star.
To demonstrate the efficacy of the model, we generated 12 compositions at random and analyzed their long-range dependencies. Of these 12 compositions, 7 were generated which satisfy a set of 147 constraints extracted from Twinkle, Twinkle, Little Star, and 5 were generated which satisfy a set of 273 constraints extracted from Somewhere Over the Rainbow. Constraints were automatically extracted at runtime using the mSW self-alignment method described by Bodily and Ventura (2021). An example of a set of constraints extracted via this method is represented by the coloring in Figure 12. A breakdown of the types and average lengths of the long-range dependency constraints in each set appears in Table 2).
Constraint Set 1 (Twinkle, Twinkle, Little Star) | Constraint Set 2 (Somewhere Over the Rainbow) | ||
Form | Ternary (ABA) | 32-bar form (AABA) | |
Measures | 12 | 32 | |
Events per measure | 8 | 8 | |
Composition length | 96 | 256 | |
Total constraint count | Harmony | 48 | 64 |
Rhythm | 16 | 96 | |
Lyrics | 32 | 26 | |
Pitch | 48 | 80 | |
Rhyme | 3 | 7 | |
Average constraint range (in eighth notes) | Harmony | 48 | 184 |
Rhythm | 80 | 138 | |
Lyrics | 64 | 192 | |
Pitch | 48 | 160 | |
Rhyme | 16 | 27 | |
Despite being generated from only 2 different sets of constraints, the 12 compositions are otherwise significantly varied. As shown in Figure 13, the majority of k-tuples (i.e., event subsequences) in any viewpoint are unique, even for small values of k, not to mention the added uniqueness that arises from the combination of viewpoints. The same variation is seen in how each of the 12 compositions satisfies its respective rhyming constraints. Of the 34 rhyming constraints collectively satisfied by the 12 compositions, over a dozen different rhyme groups are represented and only one rhyme is repeated (see Figure 14). The 12 compositions are available online at https://www2.cose.isu.edu/~bodipaul/research/pop_star/.
Although for simplification we have focused exclusively on binary relations, the algorithms and logic in this approach could be adapted to allow for n-ary relations ρ and relational constraints (X_{i}_{1}, …, X_{in}, ρ). Such adaptations are unlikely to negatively affect time- or space-complexity (except to the extent that they span longer distances) because they potentially constrain the model more heavily, thus reducing the search space significantly. These impacts and possible applications of n-ary relational constraints in Markov processes are a possible target of future research.
Of particular interest is the runtime required to construct a REGULAR NHMM. Because the RELATIONAL automaton built by Algorithm 1 essentially represents an exhaustive search tree, construction incurs an exponential runtime. While normally such a complexity might obviate the viability of the solution, we maintain that the algorithm is useful for two reasons. First, in practice the RELATIONAL automaton is rarely a full tree. The tree only expands in regions encompassed by long-range dependencies; outside of these regions, the tree collapses into a single chain. Pruning by constraints also significantly reduces the breadth of the tree throughout the search process. The fact that more constraints may be better from a runtime perspective may actually be encouraging: much future research into enhancing semantic meaning in music and lyrics is likely to evolve through constraints. Second, even for average-length pop music compositions, our experience suggests that the processing speeds of most modern computers are sufficient to generate solutions in reasonable time frames despite their increased complexity.
The solution we have presented enables probabilistic sampling of sequences that exhibit complex relational structure while maintaining natural, fully-Markovian transitions and allow via low Markov orders a broad range of expression. The ability to impose long-range dependencies using relational and regular constraints represents a critical achievement for AI composition, enabling researchers to focus on how to effectively use such tools to model music at the level of musically meaningful events and patterns.
^{1}Source code and data available at https://github.com/norkish/pop-star.
^{2}i.e., values for variables—even values which do not violate constraints—are pruned if such values cannot possibly be part of a solution consistent with constraints.
^{3}See https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html for a list of POS tag definitions.
The authors have no competing interests to declare.
Barbieri, G., Pachet, F., Roy, P., and Esposti, M. D. (2012). Markov constraints for generating lyrics with style. In Proceedings of the European Conference on Artificial Intelligence, pages 115–120. IOS Press.
Bodily, P. M. and Ventura, D. (2021). Inferring structural constraints in musical sequences via multiple self-alignment. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 43.
Briot, J.-P. and Pachet, F. (2020). Deep learning for music generation: Challenges and directions. Neural Computing and Applications, 32(4):981–993. DOI: https://doi.org/10.1007/978-3-319-70163-9
Collins, T. and Laney, R. (2017). Computer–generated stylistic compositions with long–term repetitive and phrasal structure. Journal of Creative Music Systems, 1(2). DOI: https://doi.org/10.5920/JCMS.2017.02
Colton, S., Gow, J., Torres, P., and Cairns, P. A. (2010). Experiments in objet trouvé browsing. In Proceedings of the International Conference on Computational Creativity, pages 238–247.
Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. (2019). Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
Davies, M. (2009). The 385+ million word corpus of contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2):159–190. DOI: https://doi.org/10.1075/ijcl.14.2.02dav
Hadjeres, G. and Nielsen, F. (2020). AnticipationRNN: Enforcing unary constraints in sequence generation, with application to interactive music generation. Neural Computing and Applications, 32(4):995–1005. DOI: https://doi.org/10.1007/s00521-018-3868-4
Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A., Hoffman, M., and Eck, D. (2018). Music Transformer: Generating music with long-term structure (2018). arXiv preprint arXiv:1809.04281.
Louie, R., Cohen, A., Huang, C.-Z. A., Terry, M., and Cai, C. J. (2020). Cococo: AI-steering tools for music novices co-creating with generative models. In 25th International Conference on Intelligent User Interfaces, Workshop on Human-AI Co-Creation with Generative Models.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of the Association for Computational Linguistics (System Demonstrations), pages55–60. DOI: https://doi.org/10.3115/v1/P14-5010
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.
Nunes, J. C., Ordanini, A., and Valsesia, F. (2015). The power of repetition: Repetitive lyrics in a song increase processing fluency and drive market success. Journal of Consumer Psychology, 25(2):187–199. DOI: https://doi.org/10.1016/j.jcps.2014.12.004
Oore, S., Simon, I., Dieleman, S., Eck, D., and Simonyan, K. (2020). This time with feeling: Learning expressive musical performance. Neural Computing and Applications, 32(4):955–967. DOI: https://doi.org/10.1007/s00521-018-3758-9
Pachet, F., Paris, S. C., Papadopoulos, A., and Roy, P. (2017). Sampling variations of sequences for structured music generation. In Proceedings of the International Society for Music Information Retrieval Conference, pages167–173. DOI: https://doi.org/10.4324/9781315621364-19
Pachet, F., Roy, P., Barbieri, G., and Paris, S. C. (2011). Finite-length Markov processes with constraints. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 635–642.
Papadopoulos, A., Pachet, F., Roy, P., and Sakellariou, J. (2015). Exact sampling for regular and Markov constraints with belief propagation. In Proceedings of the International Conference on Principles and Practice of Constraint Programming, pages341–350. Springer. DOI: https://doi.org/10.1007/978-3-319-23219-5_24
Papadopoulos, A., Roy, P., and Pachet, F. (2014). Avoiding plagiarism in Markov sequence generation. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, pages 2731–2737.
Perez, G. and Régin, J.-C. (2017). MDDs: Sampling and probability constraints. In International Conference on Principles and Practice of Constraint Programming, pages226–242. Springer. DOI: https://doi.org/10.1007/978-3-319-66158-2_15
Schulze, W. and Van Der Merwe, B. (2011). Music generation with Markov models. IEEE Annals of the History of Computing, 18(03):78–85. DOI: https://doi.org/10.1109/MMUL.2010.44
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., and Woelfel, J. (2004). Sphinx-4: A flexible open source framework for speech recognition. Technical report, Sun Microsystems, Inc.
Widmer, G. (2016). Getting closer to the essence of music: The Con Espressione manifesto. ACM Transactions on Intelligent Systems and Technology (TIST), 8(2):1–13. DOI: https://doi.org/10.1145/2899004