Artist similarity plays an important role in organizing, understanding, and subsequently, facilitating discovery in large collections of music. In this paper, we present a hybrid approach to computing similarity between artists using graph neural networks trained with triplet loss. The novelty of using a graph neural network architecture is to combine the topology of a graph of artist connections with content features to embed artists into a vector space that encodes similarity. Additionally, we propose a simple and effective regularization method—

To evaluate the proposed method, we use two datasets: the open OLGA dataset, which contains artist similarities from AllMusic, together with content features from AcousticBrainz, and a larger, proprietary dataset. We find that using graph neural networks yields superior overall results compared to state-of-the-art methods.

Beyond the overall evaluation, we investigate the effectiveness of the proposed model for long-tail artists. Such artists may benefit less from graph-based methods, since they typically have few known connections. We show that the proposed regularization approach clearly improves the performance for long-tail artists, without negatively affecting results for well-connected ones; it computes high-quality embeddings and good similarity scores for everyone.

Music similarity has sparked interest early in the Music Information Retrieval community, (

There is however no consensual notion of

In this paper—which is an extended version of our previous work (

In this sense, we aim at bridging the semantic gap (

A variety of methods have been devised for computing artist similarity, from the use of audio descriptors to measure similarity (

Other approaches use deep neural networks to learn artist embeddings from heterogeneous data sources and then compute similarity in the resulting embedding space (

Most recently, graph neural networks (GNNs) successfully improved upon metric-learning-based approaches: Salha-Galvan et al. (

Our artist similarity model thus combines graph approaches and embedding approaches using GNNs. The proposed model, described in detail in Section 3, uses content-based features (audio descriptors, or musicological attributes) together with explicit similarity relations between artists made by human experts (or extracted from listener feedback). These relations are represented in a graph of artists; the topology of this graph thus reflects the contextual aspects of artist similarity. The proposed graph neural network is trained using triplet loss to learn a function that embeds artists using both

We use two datasets (described in-depth in Section 4) to evaluate our approach: the OLGA dataset, which is collected from publicly available sources, comprising 17,673 artists; and a larger, proprietary dataset, consisting of 136,731 artists. Our experiment setup—metrics, models, data partitioning, etc.—is detailed in Section 5.

Beyond overall results, we take a deeper look at the model’s performance on long-tail artists. Both are presented in Section 6. In contrast to Salha-Galvan et al. (

The goal of an artist similarity model is to define a function

Many content-based methods for similarity estimation have been developed in the last decades of MIR research (see Section 2). The field has closely followed the state-of-the-art in machine learning research, with general improvements coming from the latter translating well into improvements in the former. Acknowledging this fact, we select our baselines based on the most recent developments: Siamese neural networks trained with variants of the triplet loss (

The fundamental idea of metric learning is to learn a projection _{v}_{v}_{v}

There is an abundance of methods that embed items into a vector space, many rooted in statistics, that have been applied to music similarity (_{a}_{p}_{n}

where _{a}_{p}_{n}^{+} is the ramp function.

As mentioned before, state-of-the-art music similarity models are almost exclusively based on learning deep neural networks using the triplet loss. We thus adopt this method as our baseline model, which will serve as a comparison point to the graph neural network we propose in the following sections.

A set of artists and their known similarity relations can be seen as a graph, where the artists represent the nodes, and the similarity relations their (undirected) connections. Graph methods thus naturally lend themselves to model the artist similarity problem (

The GNN we use in this paper comprises two parts: first, a block of

Overview of the graph neural network we use in this paper. First, the input features _{v}_{v}

We train the model using the triplet loss, in an identical setup as the baseline model. Viewing the proposed GNN from this angle, the only difference of the GNN from a standard embedding network is the additional

The graph convolution algorithm, as defined by Hamilton et al. (

As a neighborhood function, most models use guided or uniform sub-sampling of the graph structure (

In this work, we take a simple approach, and use point-wise weighted averaging to aggregate neighbor representations, and select the strongest 25 connections as neighbors. If weights are not available, we use the simple average of random (but fixed) 25 connections. This enables us to use a single sparse dot-product with an adjacency matrix to select and aggregate neighborhood embeddings. Note that this is not the full adjacency matrix of the complete graph, as we select only the parts of the graph which are necessary for computing embeddings for the nodes in a mini-batch.

Algorithm 1 describes the inner workings of the graph convolution block of our model. Here, the matrix ^{th} row of

To compute the output of a graph convolution layer for a node, we need to know its neighbors. Therefore, to compute the embeddings for a mini-batch of nodes

Tracing the graph to find the necessary input nodes for embedding the target node (orange). Each graph convolution layer requires tracing one step in the graph. Here, we show the trace for a stack of two such layers. To compute the embedding of the target node in the last layer, we need the representations from the previous layer of itself and its neighbors (green). In turn, to compute these representations, we need to expand the neighborhood by one additional step in the preceding GC layer (blue). Thus, the features of all colored nodes must be fed to the first graph convolution layer.

At the core of each graph convolution layer _{2}-normalized representations of each node in the mini-batch in its columns. It is fed into the following fully connected layers, which then compute the output embedding _{v}

As we observed in our experiments (see Section 5), the GNNs learned to overly rely on the graph topology. This is because—given enough GC layers—graph topology trumps features when it comes to predicting similarity (as we will see in Section 5). To alleviate this issue, we introduce a tweak during training: each time we consult the neighborhood of a node _{k}

Connection Dropout can be seen as sub-sampling the neighborhoods in the graph. Sub-sampling has been previously used in GNNs, but for a different purpose: to condense neighborhoods and to control the computational burden. Indeed, Ying et al. (

Many published studies on the topic of artist similarity are limited by data: datasets including artists, their similarity relations, and their features comprise at most hundreds to a few thousand artists. In addition, the quality of the ground truth provided is often based on 3^{rd} party APIs with unknown similarity methods like the last.fm API, rather than based on data curated by human experts.

For instance, Oramas et al. (

Due to all these issues regarding existing datasets, we compiled a new dataset, the OLGA Dataset, which we describe in the following.

For the OLGA (“

Select a common pool of artists based on the unique artists in the Million Song Dataset (

Map the available MusicBrainz IDs of the artists to AllMusic IDs using mapping available from MusicBrainz.

For each artist, obtain the list of “related” artists from AllMusic; this data can be licensed and accessed on their website. Use only related artists who can be mapped back to MusicBrainz.

Using MusicBrainz, select up to 25 tracks for each artist using their API, and collect the low-level features of the tracks from AcousticBrainz.

Compute the track feature centroid of each artist.

In total, the dataset comprises 17,673 artists connected by 101,029 similarity relations. On average, each artist is connected to 11.43 other artists. The quartiles are at 3, 7, and 16 connections per artist. The lower 10% of artists have only one connection, the top 10% have at least 27.

While the dataset size is still small compared to industrial catalog sizes, it is significantly bigger than other datasets available for this task. Its size and available features permits to apply more data-driven machine learning methods to the problem of artist similarity.

For our experiments, we partition the artists following an 80/10/10 split into 14,139 training, 1767 validation, and 1767 test artists.

We also use a larger proprietary dataset to demonstrate the scalability of our approach. Here, explicit feedback from listeners of a music streaming service is used to define whether two artists are similar or not: we derive similarity connections based on the co-occurrence of positive feedback for two artists.

For artist features, we use the centroid of an artist’s track features. These track features are

In total, this dataset consists of 136,731 artists connected by 3,277,677 similarity relations. The number of connections per artist is a top-heavy distribution with a few artists sharing most of the connections: the top 10% are each connected to more than 134 others, while the bottom 10% to only one. The quartiles are at 2, 5, and 48 connections per artist.

We follow the same partition strategy as for the OLGA dataset, which results in 109,383 training, 13,674 validation, and 13,674 test artists.

Our experiments aim to evaluate how well the embeddings produced by our model capture artist similarity. To this end, we set up a ranking scenario: given an artist, we collect its _{K}

where

In the following, we first explain the models, their training details, the features, and the evaluation data used in our experiments. Then, we show, compare and analyze the results.

As explained in Section 3.2.1, a GNN with no graph convolution layers is identical to our baseline model (i.e. a DNN trained using triplet loss). This allows us to fix hyper-parameters between the baseline and the proposed GNN, and isolate the effect of adding graph convolutions to the model. For each dataset, we thus train and evaluate four models with 0 to 3 graph convolution layers.

The other hyper-parameters remain fixed: each layer in the graph convolutional front-end consists of 256 ELUs (_{2}-normalized embeddings.

We are able to train the largest model with 3 graph convolution layers within 2 hours on the proprietary dataset, and under 5 minutes on OLGA, using a Tesla P100 GPU and 8 CPU threads for data loading, which includes tracing the graph to find the relevant neighborhood as explained in Section 3.2.2.

We build artist-level features by averaging track-level features of the artist’s tracks. Depending on the dataset, we have different types of features at hand.

In the OLGA dataset, we have low-level audio features extracted by the AcousticBrainz project using the Essentia library. These features represent track-level statistics about the loudness, dynamics and spectral shape of the signal, but they also include more abstract descriptors of rhythm and tonal information, such as BPM and the average pitch class profile.

Although AcousticBrainz also provides high-level features such as mood and genre predictions, we refrain from using them. The reason is twofold: first, they are derived from the low-level features themselves, and as such, do not provide complementary information; second, as stated on the AcousticBrainz website itself, the high-level features may be subject to change if and when the models predicting them are changed, re-trained or improved.

We select all numeric features and pre-process them as follows: we apply element-wise standardization, discard features with missing values, and flatten all numbers into a single vector of 2613 elements.

In the proprietary dataset, we use numeric musicological descriptors annotated by experts (for example, “the nasality of the singing voice”). We apply the same pre-processing for these, resulting in a total of 170 values.

Using two different types of content features gives us the opportunity to evaluate the utility of our graph model under different circumstances, or more precisely, features of different quality and signal-to-noise ratio. The low-level audio-based features available in the OLGA dataset are undoubtedly noisier and less specific than the high-level musical descriptors manually annotated by experts, which are available in the proprietary dataset. Experimenting with both permits us to gauge the effect of using the graph topology for different data representations.

In addition, we also train models with

As described in Section 4, we partition artists into a training, validation and test set. When evaluating on the validation or test sets, we only consider artists from these sets as candidates and potential true positives. Specifically, let _{eval} be the set of evaluation artists; we only compute embeddings for these, and retrieve nearest neighbors from this set, and only consider ground truth similarity connections within _{eval}.

This notion is more nuanced in the case of GNNs. Here, we want to exploit the _{train}(the training set) _{train} and _{eval}. This process is outlined in

Artist nodes and their connections used for training (green) and evaluation (orange). During training, only green nodes and connections are used. When evaluating, we extend the graph with the orange nodes, but only add connections between validation and training artists. Connections among evaluation artists (dotted orange) remain hidden. We then compute the embeddings of all evaluation artists, and evaluate based on the hidden evaluation connections.

Note that this does not leak information between train and evaluation sets; the features of evaluation artists have not been seen during training, and connections within the evaluation set—these are the ones we want to predict—remain hidden.

Overall evaluations portray a model’s performance from a birds-eye view. Beyond that, we are interested in the performance of our model for the segment of long-tail artists. Such artists usually have few known connections, which not only limits the information a GNN is able to leverage, but also limits our capability to evaluate how well the GNN is able to leverage existing information. Since ground truth for these artists is sparse, retrieved lists of similar artists can contain relevant items for which we do not know that they are relevant; we cannot quantitatively distinguish a list of bad recommendations from a list of good recommendations of which we do not know that they are indeed good.

To circumvent this problem, we collect a subset of well-connected artists for which we will then artificially sparsify

Depending on the dataset, we use different criteria to select these artists. Since each dataset differs in size and connection density, parameters that work for one would not work for the other. For the proprietary dataset, which is large and densely connected, we use artists with at least 25 connections to the training graph (known evaluation connections), and 50 unseen connections (“evaluation connections” in

We will first discuss the overall results in the following section. Then, we will use the subsets of artists selected in Section 5.4 to evaluate the sensitivity of our model to decaying connectivity, as observed with less popular artists.

NDCG@200 for the baseline (DNN) and the proposed model with 3 graph convolution layers (GNN), using features or random vectors as input. The GNN with real features as input gives the best results. Most strikingly, the GNN with random features—using only the known graph topology—out-performs the baseline DNN with informative features.

DATASET | FEATURES | DNN | GNN |
---|---|---|---|

OLGA | Random | 0.02 | 0.45 |

AcousticBrainz | 0.24 | 0.55 | |

Proprietary | Random | 0.00 | 0.52 |

Musicological | 0.44 | 0.57 | |

Additionally, the results indicate—perhaps to little surprise—that low-level audio features in the OLGA dataset are less informative than manually annotated high-level features in the proprietary dataset. Although the proprietary dataset poses a more difficult challenge due to the much larger number of candidates (14k vs. 1.8k), the DNN—which can only use the features—improves more over the random baseline in the proprietary dataset (+0.44), compared to the improvement (+0.22) on OLGA. These are only indications; for a definitive analysis, we would need to use the exact same features in both datasets.

Similarly, we could argue that the topology in the proprietary dataset seems more coherent than in the OLGA dataset. We can judge this by observing the performance gain obtained by a GNN with random features—which can only leverage the graph topology to find similar artists—compared to a completely random baseline (random features without GC layers). In the proprietary dataset, this performance gain is +0.52, while in the OLGA dataset, only +0.43. Again, while this is not a definitive analysis (other factors may play a role), it indicates that the large amounts of user feedback used to generate ground truth in the proprietary dataset give stable and high-quality similarity connections.

Results on the OLGA (top) and the proprietary (bottom) dataset with different numbers of graph convolution layers, using either the given features (left) or random vectors as features (right). Error bars indicate 95% confidence intervals computed using bootstrapping.

Looking at the scores obtained using random features (where the model depends solely on exploiting the graph topology), we observe two remarkable results. First, whereas one graph convolution layer suffices to out-perform the feature-based baseline in the OLGA dataset (0.28 vs. 0.24), using only one GC layer does not produce meaningful results (0.05) in the proprietary dataset. We believe this is due to the different sizes of the respective test sets: 14k in the proprietary dataset, while only 1.8k in OLGA. Using only a very local context seems to be enough to meaningfully organize the artists in a smaller dataset.

Second, most performance gains are obtained with two GC layers, while adding the third GC layer pushes the results to a much lesser degree. Our explanation for this effect is that most similar artists are connected through at least one other, common artist. In other words, most artists form similarity cliques with at least two other artists. Within these cliques, in which every artist is connected to all others, missing connections are easily retrieved by no more than 2 graph convolutions.

In fact, in the OLGA dataset, ~71% of all cliques fulfill this requirement. This means that, for any hidden similarity link in the data, in 71% of cases, the true similar artist is within 2 steps in the graph—which corresponds to using two GC layers.

Let us now focus on the results specific for long-tail artists. As explained in Section 5.4, we will not use actual long-tail artists for this, since data sparsity prevents a solid evaluation. Instead, we emulate the long-tail condition by removing known connections of well-connected artists, while keeping all their unseen evaluation connections. From the OLGA dataset, we collected 44 artists with at least 25 known connections and at least 5 unseen ones; for the proprietary dataset, we found 207 artists with at least 25 known connections, and at least 50 unseen ones (the proprietary dataset is larger and more densely connected).

We train the largest models with 3 graph convolution layers using varying connection dropout probabilities: 0.0, 0.25, 0.5, 0.75, 0.95 and 0.99; a connection dropout probability of 0.0 corresponds to the baseline GNN model with no connection dropout. Once these models are trained, we use them to evaluate the resulting artist embeddings in different connectivity settings: we sweep the

Evaluation of the long-tail performance of a 3-GC-layer model on the OLGA dataset (top) and the proprietary dataset (bottom). The different bars represent models trained with different probabilities of connection dropout. The gray line in the background represents the baseline model with no graph convolution layers, with the shaded area indicating the 95% confidence interval. We see that for the standard model (blue, no connection dropout), performance degrades with fewer connections. Introducing connection dropout significantly reduces this effect.

We also observe how connection dropout greatly reduces that degradation,

Using connection dropout achieves better results for sparsely connected artists because it prevents the GNN from relying too much on the graph connectivity when computing the embedding. To substantiate this claim, we examine the stability of artist embeddings while manually removing known connections, using the same subset of artists as before. We consider the embedding computed using all 25 known connections to be the true embedding of an artist. We then remove known connections one by one, compute a new artist embedding at each level of connectivity, and calculate the cosine distance of these embeddings to the true embedding.

The results are shown in

Cosine distance between embeddings computed using reduced connectivity and the “true” embedding (computed using all 25 known connections). Without connection dropout, the GNNs learn to rely too much on the graph connectivity to compute the artist embedding: the distance between an embedding computed using fewer connections and the “true” embedding grows quickly. With connection dropout, we can strongly curb this effect.

In this paper, we described a hybrid approach to computing artist similarity, which uses graph neural networks to combine content-based features with explicit relations between artists.

To evaluate our approach, we assembled a dataset with 17,673 artists, their features, and their similarity relations. Additionally, we used a much larger proprietary dataset to show the scalability of our method. The results showed that leveraging known connections between artists can be more effective for understanding their similarity than high-quality features, and that combining both gives the best results.

The introduction of

Our work is a first step towards models that directly use known relations between musical entities, like tracks, albums, artists, or even genres. Future work could investigate how to employ multi-modality in this context; for example, we could build a multi-modal graph by using connections between different types of entities (e.g. tracks, albums, artists), or different types of connections between the same entities (e.g. artist collaborations, band memberships). Another avenue of research could focus on collecting and using better and/or higher-level features for the OLGA dataset. This would provide a better judgement of the importance of feature quality in the proposed model.

The procedure to assemble the dataset, including relevant metadata, is available on

The exact list of low-level features we use is available at

The authors have no competing interests to declare.