A- A+
Alt. Display

# On End-to-End White-Box Adversarial Attacks in Music Information Retrieval

## Abstract

Small adversarial perturbations of input data can drastically change the performance of machine learning systems, thereby challenging their validity. We compare several adversarial attacks targeting an instrument classifier, where for the first time in Music Information Retrieval (MIR) the perturbations are computed directly on the waveform. The attacks can reduce the accuracy of the classifier significantly, while at the same time keeping perturbations almost imperceptible. Furthermore, we show the potential of adversarial attacks being a security issue in MIR by artificially boosting playcounts through an attack on a real-world music recommender system.
Keywords:
How to Cite: Prinz, K., Flexer, A. and Widmer, G., 2021. On End-to-End White-Box Adversarial Attacks in Music Information Retrieval. Transactions of the International Society for Music Information Retrieval, 4(1), pp.93–104. DOI: http://doi.org/10.5334/tismir.85
Published on 07 Jul 2021
Accepted on 28 May 2021            Submitted on 11 Jan 2021

## 1 Introduction

Adversarial examples were first reported in the field of image classification (Szegedy et al., 2014), where marginal perturbations of input data could significantly degrade the performance of a machine learning system. From there on, the phenomenon was observed in various other fields, including natural language and speech processing (Carlini and Wagner, 2018; Zhang et al., 2020), as well as MIR (Sturm, 2014). Nevertheless, literature concerning adversarial vulnerability in MIR remains sparse and its relevance questionable, as it is not considered to pose a security issue, as it is in other fields.

As a second contribution, we motivate the potential seriousness of adversarial attacks in MIR by targeting a real-world content-based music recommendation system (Gasser and Flexer, 2009; Flexer and Stevens, 2018) with one of the previously developed attacks. The attack significantly increases how likely is it for rarely-recommended songs to become recommended, exploiting the so-called hubness phenomenon (Pachet and Aucouturier, 2004) and challenging the integrity of the system.

## 2 Related Work and New Contributions

Since the discovery of the vulnerability of machine learning systems to adversarial examples (Szegedy et al., 2014), numerous attacks in different domains have been proposed. These attacks are often distinguished by factors such as the amount of knowledge about an attacked system that is used (Akhtar and Mian, 2018; Zhang et al., 2020), e.g., white- or black-box attacks; the number of computational steps a method requires, e.g., one-step in contrast to iterative attacks; or whether the attack aims to change a prediction to a particular target or arbitrarily, i.e., targeted as opposed to untargeted attacks.

The focus of this work is systems built on musical data, which we attack using a white-box adversarial approach with full knowledge of the attacked systems. Despite most research having been conducted on image data, audio systems have also been shown to be vulnerable, e.g., by Carlini and Wagner (2018); Du et al. (2020). Different approaches in the audio domain adapt attacks from the image domain (Subramanian et al., 2020) or use “psychoacoustic hiding” in an attempt to make adversarial perturbations as imperceptible as possible (Schönherr et al., 2019; Qin et al., 2019).

As first adversarial attacks in MIR, filtering transformations and tempo changes have been used in untargeted black-box attacks to deflate and inflate the performance of genre, emotion and rhythm classification systems to no better than chance level or perfect 100% (Sturm, 2014, 2016). Also a targeted white-box attack on genre recognition systems has been proposed (Kereliuk et al., 2015), in which magnitude spectral frames computed from audio are treated as images and attacked building upon approaches from image object recognition.

Instrument classification itself can be divided into the following tasks, with increasing degree of difficulty: isolated note instrument classification, solo instrument classification, and multi-label classification in polyphonic mixtures (Lostanlen et al., 2018). We use data that yields isolated note as well as solo instrument and singing voice audio, leading to an expanded version of isolated note classification, a task for which good performance has been reported previously (Bhalke et al., 2016). We chose solo instrument sound classification as a relatively simplistic test-bed because the monophonic nature of the audio makes added audio perturbations more easily perceptible and the task provides an unambiguous ground-truth. Listening examples of results in Section 3.5.3 corroborate this. As a classifier we chose a Convolutional Neural Network (CNN), which is now state-of-the-art for these kind of tasks (Lostanlen et al., 2018), with the specific architecture described in Section 3.3 having achieved good results previously (Paischer et al., 2019).

As our second contribution, we motivate adversarial attacks in MIR by attacking a real-world content-based music recommendation system (Gasser and Flexer, 2009) which recommends nearest neighbours of a query song according to an audio-based similarity function. Due to a problem of measuring distances in high dimensions, so-called hub songs are recommended over and over again while anti-hubs never appear in recommendation lists. As a result, only about two-thirds of the music catalogue is reachable in the recommendation interface and over a third of the songs are never recommended (Flexer and Stevens, 2018). We chose this system as its mode of operation is well documented, the data is freely available and the issues with hub songs have been reported previously. In addition to that, users can contribute music to the system, which is needed for the setting of our attack.

The term Hubness was first described in MIR (Pachet and Aucouturier, 2004), and is now acknowledged as a general machine learning problem and a new aspect of the curse of dimensionality (Radovanović et al., 2010; Schnitzer et al., 2012; Feldbauer and Flexer, 2019). Additionally, hubness is also a relevant problem in recommendation systems using collaborative filtering (Knees et al., 2014), a method which is very common in recommendation systems (Schedl, 2019). Only recently, adversarial attacks on multimedia recommendation systems have gained attention (Deldjoo et al., 2021), with a particular focus on systems based on collaborative filtering and multimedia data other than music.

Our experiments show that we can significantly degrade the performance of an instrument classifier, as well as severely mislead a real-world music recommender and artificially boost playcounts by computing adversarial perturbations directly on raw waveforms.

## 3 Attacking an Instrument Classifier

### 3.1 Data

The data we use for instrument classification is a subset of the curated training set provided for Task 2 in the DCASE2019 Challenge (Fonseca et al., 2019). Out of originally 4,970 samples with diverse audio labels, we use all 799 mono audio files annotated with one out of 12 different musical labels: Accordion, Acoustic Guitar, Bass Drum, Bass Guitar, Electric Guitar, Female Singing, Glockenspiel, Gong, Harmonica, Hi-Hat, Male Singing, and Marimba/xylophone. The labels are uniformly distributed (75 occurrences each, with two exceptions — Accordion: 47, Glockenspiel: 56). In what follows, we use the term instrument to describe both instrumental and singing voice sounds. The labels are weak, i.e., only available at clip-level without starting and ending time information.

Audio samples consist of isolated notes or single instrument sounds as well as singing voice audio, with very few of the samples having potentially incomplete labels. In other words, the labelled instrument is always present in the audio, but there might be additional instrument sounds not represented in the respective annotation (Fonseca et al., 2019). The length of different audio samples varies from 0.3 to 29.29 seconds (with a median of 3.3 seconds).

### 3.2 Data Preprocessing

To prepare the data for instrument classification, all audio files are first resampled from 44.1 kHz to 16 kHz. In contrast to previous work on adversaries in MIR (Kereliuk et al., 2015), we compute adversarial perturbations on raw waveforms directly. This allows us to avoid having an additional processing step that ensures valid time-domain signals, which is otherwise necessary (Kereliuk et al., 2015). To realise this, we use a differentiable preprocessing module1 before classification that transforms raw audio into the time-frequency domain, as is done similarly by Carlini and Wagner (2018).

Within the preprocessing module we compute spectrograms with an Fast Fourier Transform (FFT) size of 2,048, an equally sized Hann window and a hop size of 512. The spectrograms are mapped to mel-scale, with 100 mel-bands between 40 Hz and 8 kHz. The resulting mel-spectrograms are then transformed to decibel (dB) scale and normalised to have zero mean and a standard deviation of one over the different mel-bands, across all training data. In cases where spectrograms need to be lengthened, e.g., due to network requirements, we perform padding by repeating the spectrogram and inserting the repetitions with equal lengths both before and after the original signal.

### 3.3 Instrument Classification System

For instrument classification, we subsequently use a CNN, with its architecture depicted in Figure 1. The network consists of multiple convolutional layers with 5 × 5 and 3 × 3 convolutions and ReLU activations (Nair and Hinton, 2010), followed by batch normalisation (Ioffe and Szegedy, 2015) and average-pooling layers. Padding values correspond to the strides of convolutional layers. For regularisation, dropout is used. The final layers consist of 1 × 1 convolutions and a global pooling layer, which performs global average-pooling over the frequency dimension, and global max-pooling over the time dimension of an input.

Figure 1

CNN architecture of the instrument classifier. Input: channels@mel bands × windows.

For training and evaluation, we split the data into training and validation sets containing 599 and 200 audio files respectively. To train the network, we use the cross-entropy loss and the Adam optimiser (Kingma and Ba, 2015) to minimise the loss on the training data. We use an initial learning rate of 0.001 for 200 epochs and a batch-size of 16. After 90 epochs we apply learning rate decay with a multiplicative factor of 0.1. Hyper-parameters are tuned on the validation set. Note that the accuracy on the validation data hence is a potential over-estimation; due to the low total number of available samples and our foremost interest in lowering the performance of the system as opposed to showing its generalisation ability, we refrain from using a separate test set. To rule out that attacking the biased classifier simplifies the task of finding adversarial examples, we report additional experiments with separate training, validation and test sets in the supplementary material Section A.5. These additional results are very similar to those reported in the following sections and support our decision of not using a separate test set. The evaluation of different adversarial attacks with various parameter settings is also performed on the validation set.

For batch-wise training, we use windowed spectrograms of length 116. For shorter spectrograms, we perform padding as described above; for longer spectrograms, we extract half-overlapping windows of length 116 and choose one arbitrary window in each iteration. When performing validation, on the other hand, we use spectrograms with their original lengths, or padded versions to fulfil minimum length requirements of the network.

We subsequently describe the four attacks we apply in this work (Section 2) based on how they compute adversarial perturbations. Two of the attacks, namely FGSM and PGD on the negative loss function are untargeted, i.e., aim to change the output of a system arbitrarily. The remaining two, which are both adaptations of the C&W method proposed originally for a speech-to-text system (Carlini and Wagner, 2018), are targeted attacks. All four attacks assume full knowledge of a system, i.e., are white-box attacks. The notation is summarised in Table 1. Note that gradient updates are performed based on the sign of a gradient, and iterative methods clip a perturbation δ after each epoch to stay in the range of [–ε, ε].

Table 1

Notation of variables.

Variable Meaning

x Original signal
y Ground-truth label
X Time-frequency representation of x
$\stackrel{^}{x}=x+\delta$ Adversarial example
t Target class/prediction
f System (e.g., instrument classifier)
Lsys System-specific loss function (e.g., cross-entropy loss)
ep Current iteration
δep Perturbation during iteration ep
α Weight factor for adversarial objective

FGSM: This method is one of the first adversarial attacks introduced in the literature (Goodfellow et al., 2015), and a single-step method. The attack attempts to change predictions of a system by finding a perturbation that increases the loss Lsys. To realise this, a scaling factor λ is used to perform a scaled step in the gradient direction of the loss function, i.e.,

(1)
$\delta =\lambda *sign\left({\nabla }_{x} {L}_{sys}\left(f\left(x\right),y\right)\right).$

The scaling factor λ controls the ease of finding an adversarial perturbation (larger λ) versus the imperceptibility thereof (smaller λ).

PGD-Attack: The second attack, which is PGD on the negative loss function (Madry et al., 2018), extends FGSM to an iterative approach, where perturbations are computed by performing multiple steps in the gradient direction. The size of these steps is determined by the factor η, leading to

(2)
${\delta }_{ep+1}=cli{p}_{ϵ}\left({\delta }_{ep}+\eta *sign\left({\nabla }_{{\delta }_{ep}} {L}_{sys}\left(f\left(x+{\delta }_{ep}\right),y\right)\right)\right).$

C&W: C&W aims to find adversarial perturbations by decreasing the system loss Lsys with respect to a new target prediction t. To control the magnitude of the distortion, the method minimises a weighted combination of squared L2-norm of the perturbation and the speech-to-text specific CTC-loss function as their system loss (Carlini and Wagner, 2018), i.e.,

(3)
$\begin{array}{l}{L}_{total1}= \parallel {\delta }_{ep}\parallel { }_{2}^{2}+\alpha *{L}_{sys}\left(f\left(x+{\delta }_{ep}\right), t\right),\\ {\delta }_{ep+1}=cli{p}_{ϵ}\left({\delta }_{ep}-\eta *sign\left({\nabla }_{{\delta }_{ep}} {L}_{total1}\right)\right).\end{array}$

Carlini and Wagner (2018) additionally use a rescaling factor to refine initial adversarial perturbations in order to stay below signal-to-noise ratio (SNR) thresholds, which we omit in this work and instead compare the attacks based on the initial adversaries that are found.

Multi-Scale C&W: For the fourth attack, we adapt the C&W method and exchange the squared L2-norm with the audio-specific multi-scale loss (Engel et al., 2020), i.e.,

(4)
$\begin{array}{ll}{L}_{spec}\hfill & =\sum _{i}\parallel {X}_{i}-{\stackrel{^}{X}}_{i}{\parallel }_{1}+\parallel \mathrm{log}{X}_{i}-\mathrm{log}{\stackrel{^}{X}}_{i}{\parallel }_{1},\hfill \\ {L}_{total2}\hfill & ={L}_{spec}+\alpha *{L}_{sys}\left(f\left(x+{\delta }_{ep}\right),t\right),\hfill \\ {\delta }_{ep+1}\hfill & =cli{p}_{ϵ}\left({\delta }_{ep}-\eta *sign\left({\nabla }_{{\delta }_{ep}} {L}_{total2}\right)\right).\hfill \end{array}$

The index i is used to iterate over different spectrograms Xi, which are the result of applying Short-Time Fourier Transforms (STFTs) with different FFT sizes, and hop sizes corresponding to a window-overlap of 75%, as proposed by Engel et al. (2020).

### 3.5 Experiments

#### 3.5.1 Threat Model and Evaluation

In this work, we attempt to find waveforms $\stackrel{^}{x}=x+\delta$, such that x and $\stackrel{^}{x}$ sound similar but $f\left(x\right)\ne f\left(\stackrel{^}{x\right)}$, where f is some classifier function. As has been done in related work (Carlini and Wagner, 2018), we assume a white-box scenario, in which we know about the model f and its parameters. An extension to black-box attacks, e.g. by transferring adversarial examples from other architectures, is discussed in Section 5.

To measure the quality of adversarial examples, i.e., quantify how similar x and $\stackrel{^}{x}$ sound, we use the SNR, as has been done previously in the audio field (Kereliuk et al., 2015; Carlini and Wagner, 2018). The SNR is computed as the ratio between original signal x and added perturbation δ. Furthermore, we use the number of samples on which the attack succeeds in changing the classifier’s decision as a measure of success.

We additionally list the accuracy of the system after an attack w.r.t. the ground-truth and the median number of iterations required to find an adversarial example.

Whenever we refer to differences being statistically significant in the remainder of this section, we mean significant as tested via paired t-tests at a 5% error level; for more detailed results, including exact t-test results and parameters for all different methods, we refer to Section A.1 in the supplementary material.2

#### 3.5.2 Implementation Details

In what follows, we use stochastic gradient ascent to perform the single gradient update for FGSM. For the remaining three of the four adversarial attacks, we use Adam to perform gradient descent and ascent, as appropriate. We apply a grid-search to find attack-specific parameters, and restrict each iterative method to find adversarial perturbations within 500 iterations. Furthermore, a target class different than the original prediction is randomly selected for each sample. These sampled targets are used in all parameter settings within the grid-search.

A perturbation δ0 is initialised with zeros for all attacks except the PGD-attack, for which we sample δ0 from a uniform distribution 𝒰(–ɛ, ɛ), where ɛ is the clipping factor introduced in Section 3.4. The system loss Lsys for the instrument classifier corresponds to the cross-entropy loss.

#### 3.5.3 Results

In Table 2, we compare the four adversarial attacks based on five different factors. To compute the accuracy of the system after an attack (fourth column), we use predictions on the perturbed signal in cases where the attack was successful in finding an adversarial example, and predictions on the unperturbed signal otherwise. The last three columns are computed solely on adversarial examples. As the initial perturbation for the PGD-Attack and the targets for C&W and Multi-Scale C&W are sampled randomly, we repeat these experiments five times and report mean and standard deviation over all runs. Lines 4 to 7 in Table 2 are based on the parameter settings in our grid-search that achieve the highest average SNR with at least 150 found adversarial examples out of 200 samples. In lines 8 to 11, we show how these results change when requiring at least 180 found adversarial examples. Line 3 contains the accuracy of the system after we perturb original audios with random white-noise instead of adversarial noise, averaged over 5 runs. We here used a comparable SNR as achieved by C&W and Multi-Scale C&W.

Table 2

Comparison of the adversarial attacks on our instrument classifier. Results are chosen based on largest SNR with at least 150 (lines 4 to 7) and 180 (lines 8 to 11) successfully found adversarial examples out of 200. Depicted are averages or the median over samples; for the PGD-Attack, C&W and Multi-Scale C&W additionally average and standard deviation* of results over five runs are stated. Line 3 contains a baseline with random white-noise instead of adversarial perturbations.

Samples Required Data Origin # Samples Accuracy SNR Iterations

Clean 200 0.835

White-noise 200 0.785 ± 0.000* 42.71 ± 0.00*

min.150 FGSM 153 0.250 –7.74 1.0
PGD-Attack 151.8 ± 0.7* 0.171 ± 0.004* 40.13 ± 0.05* 15.8 ± 0.4*

C&W 153.2 ± 2.6* 0.201 ± 0.016* 44.23 ± 0.37* 51.4 ± 2.7*
C&Wmulti_scale 163.6 ± 3.0* 0.167 ± 0.012 * 43.82 ± 0.09* 71.6 ± 5.4*

min.180 FGSM 179 0.130 –24.83 1.0
PGD-Attack 190.8 ± 1.2* 0.026 ± 0.004* 16.47 ± 0.10* 2.0 ± 0.0*

C&W 180.2 ± 2.3* 0.094 ± 0.010* 42.98 ± 0.18* 66.1 ± 3.7*
C&Wmulti_scale 196.4 ± 1.0* 0.024 ± 0.004* 39.49 ± 0.17* 22.6 ± 1.0*

Number of Samples: The results for each of the four adversarial attacks are chosen such that either at least 150 or 180 adversarial examples are found for the 200 samples in our validation set. The overall lowest number is achieved by the PGD-Attack, with an average of 151.8. Both when requiring at least 150 or 180 samples, Multi-Scale C&W is the method that finds the highest number of adversarial examples, with on average 163.6 and 196.4 samples. The difference to the remaining methods is statistically significant. Note that for all of the four methods except FGSM it is possible to find at least 180 adversarial examples; for the single-step method, the highest number of samples that is found within our grid-search is 179.

Accuracy: The average accuracy of the instrument classifier on the validation data before any adversarial attack is 0.835. The average accuracy after an attack is strongly influenced by the number of adversarial examples an adversary finds. The more often an adversary is successful in finding adversarial perturbations, the lower the accuracy tends to go; exceptions are adversarial examples that correct the prediction for previously misclassified samples. After an adversarial attack, the reduced accuracy of our system is always close to a baseline accuracy of 0.125, which can be achieved by predicting only the most frequent class within the validation set (‘Gong’). More precisely, the average accuracy is between 0.250 and 0.167 when we require 150 adversarial examples found by an adversary; when requiring at least 180 examples, this drops further to values between 0.130 and 0.024. In both scenarios, FGSM achieves the highest accuracy with 0.250 and 0.130, followed by the targeted C&W method with averages of 0.201 and 0.094. The PGD-Attack leads to the second lowest accuracies with 0.171 and 0.026. Multi-Scale C&W results in the lowest average accuracy of 0.167 and 0.024, which is statistically significant compared to FGSM and C&W.

To put these results into perspective, we take a look at how the accuracy of our system changes if we add random white-noise instead of adversarial perturbations to validation samples (see line 3 in Table 2, cf. Szegedy et al. (2014)). We overlay original signals with white-noise of similar average SNR as the two targeted attacks (>40 dB), and repeat this experiment again for five different random seeds. With an average accuracy of 0.785 ± 0.000 over the five runs, we always remain relatively close to the accuracy of 0.835 on clean data.

In Figure 2, three confusion matrices illustrate the accuracy of our system after different attacks. The columns are ground-truth labels, and the rows represent predictions; correct predictions are in the diagonal, and confusions off-diagonal. Each column is normalised to sum to 1, such that each column shows the proportion of (in-)correct classifications for one ground-truth class. For detailed confusion matrices including numerical values, we refer to Section A.2 in the supplementary material. The leftmost image shows the performance on clean data. The majority of predictions are concentrated in the diagonal of the matrix; the classes which are least often classified correctly by our system are ‘Marimba/xylophone’ and ‘Electric guitar’. The remaining two matrices are chosen to represent an untargeted as well as a targeted attack leading to a broad variety of new classifications. The confusion matrix in the middle shows the confusions of our system after a PGD-Attack. The attack leads to diverse new predictions, and appears to only have difficulties finding adversarial examples for samples with the ground-truth class ‘Male singing’, resulting in the only noticeable diagonal entry in Figure 2b. The rightmost confusion matrix in Figure 2 depicts confusions after a Multi-Scale C&W attack. The new predictions are diverse, and values in the diagonal of the matrix are close or equal to 0.

Figure 2

Confusion matrices computed on validation data, showing correct predictions in the diagonal, confusions off-diagonal. For samples without adversarial counterpart, original audio is used. Columns are ground-truth labels and rows predictions; columns are normalised to sum to 1. Order of labels (left to right and top to bottom): Accordion, Acoustic guitar, Bass drum, Bass guitar, Electric guitar, Female singing, Glockenspiel, Gong, Harmonica, Hi-hat, Male singing, and Marimba/xylophone.

When increasing the number of required adversarial examples, the average SNRs for both C&W and Multi-Scale C&W decrease slightly to 42.98 dB and 39.49 dB respectively; the average SNR of the PGD-Attack, however, drops significantly to 16.47 dB. This leads to perceptible adversarial perturbations of the PGD-Attack, as the listening examples suggest, although the original class remains recognisable. The difference of the SNR of C&W compared to the remaining three methods is statistically significant.

Iterations: As the number of iterations it takes for an adversary to find a perturbation on a particular sample strongly varies, we compute the median number of iterations per run. Whenever multiple runs are performed for a particular attack, we state the average ± the standard deviation of the median number of iterations. By definition, the single-step FGSM is the method requiring the fewest number of iterations. The second fewest iterations are needed by the PDG-Attack, taking 15.8 and 2.0 iterations on average, for at least 150 and 180 successful attacks respectively. For the targeted methods, the number of iterations needed depends on specific parameter settings; in the first case, C&W requires 51.4 iterations as opposed to 71.6 for Multi-Scale C&W. In the second scenario however, Multi-Scale C&W takes less iterations on average with 22.6 iterations in contrast to 66.1 by C&W.

Note that the algorithms tend to find successful adversarial examples quicker in terms of number of iterations when requiring more adversarial examples (180). We assume that this is because the algorithm then prioritises finding perturbations that change the prediction of a sample over trying to keep the perturbation as small as possible, therefore potentially allowing larger changes to the perturbation in every update. The only exception here is C&W, which often finds perturbations in the first scenario (i.e., 150 samples) within a similar number of iterations as in the second scenario (180 samples), or not within the maximal number of iterations (500) at all.

In addition to looking at the number of iterations that different attacks take for various parameter settings, we compare runtime complexities of single iterations with respect to the waveform length N. The gradient ∇δLsys(f(x+δ), y) is computed in each of the four methods, which is why we consider it to be constant and focus on the complexity of the remaining computations. For FGSM, the PGD-Attack and C&W, all computations are elementwise, and therefore O(N). Multi-Scale C&W on the other hand is dominated by the STFT in the computation of Lspec in Eqn. (4), and is hence O(NlogL), where L is the maximal window size of the STFTs.

#### 3.5.4 Using Targeted Attacks to Predict Accordion

Instead of randomly choosing a target class when using the two targeted attacks, we can also try to transform any class into a particular target class. To demonstrate this, we use the ‘Accordion’ class, as it is one of the hardest target classes to reach in prior experiments. Out of the 200 samples within the validation set, 12 have ‘Accordion’ as the initial prediction of the system already, and are therefore skipped during the attack. For C&W, we find adversarial perturbations on all but four samples (out of 188) with the ground-truth classes ‘Gong’ and ‘Male singing’. The adversarial examples have an average SNR of 37.41 dB ± 10.72. As a median, the adversary takes 29 iterations to find a successful adversarial perturbation.

For Multi-Scale C&W, the adversary is successful in finding adversarial examples for all but seven samples (out of 188). The adversarial perturbations lead to an average SNR of 37.57 dB ± 10.82. The median of required iterations is 33. For more information on the parameters we use for the two methods as well as listening examples, we refer to sections A.2.6 and A.3.4 in the supplementary material.

## 4 Attacking a Music Recommender

### 4.1 Data

The data used for the music recommendation system consists of 15,750 songs taken from the music discovery system FM4 soundpark.3 The music itself is divided into six different genres, namely Electronica, Funk, Hip-Hop, Pop, Reggae and Rock (Gasser and Flexer, 2009). Due to length requirements in the preprocessing step (see Section 4.2), all songs we use have a duration of at least two minutes; the median duration of a song is ~4 minutes, and the maximal duration 33 minutes.

### 4.2 Data Preprocessing

The data in Section 4.1 is preprocessed as in the existing real-world system (Gasser and Flexer, 2009). First, the audio is converted to mono and resampled to 22.05 kHz. We then compute Mel Frequency Cepstrum Coefficients (MFCCs) of the central two minutes of each song. To do this, we apply an STFT with a window size of 1,024 samples, a hop size of 512 samples, and a Hann window. The resulting spectrogram is subsequently transformed to mel-scale, converted to decibels and finally compressed to 20 MFCCs by applying a discrete cosine transform.

### 4.3 Music Recommendation System

To perform content-based music recommendation, Gasser and Flexer (2009) use the spectral similarity of different songs. After obtaining the MFCCs for each song, a single Gaussian is learned to represent a particular song (Mandel and Ellis, 2005). Then, the Kullback-Leibler (KL) divergence is computed between the Gaussian of any two songs. The recommendations for a song are obtained by finding its k-nearest neighbours by means of the smallest KL divergences; Gasser and Flexer (2009) use k = 5.

### 4.4 Hubness

As previously stated (Section 2), the issue of hubness is also present in the music recommendation system described in Section 4.3. It results in unfair recommendations, where hub songs are often close neighbours of songs and therefore frequently recommended, while ‘anti-hubs’ do not occur within the k-nearest neighbours and hence are never recommended. An important measure to characterise hubness is the k-occurrence Ok of a song, which counts the number of times the song occurs within the k nearest neighbours of all remaining songs. Note that the average of all k-occurrences Ok in a data set is always exactly k. A hub is a song with unusually high k-occurrence. We refer to the threshold for the k-occurrence that separates hubs from non-hubs as the hub-size, e.g., with a hub-size of 5k, a hub must have Ok > 5k (five times the average k-occurrence). Anti-hub songs have a k-occurrence Ok = 0, and songs where 0 < Ok ≤ 5k are considered normal (Flexer and Stevens, 2018). Typical values of k used in the literature range from 1–20, with previous research indicating results with different values of k being highly correlated (Feldbauer and Flexer, 2019). As in previous analysis of the music recommender described in Section 4.3, we use k = 5 for our hubness analysis.4

### 4.5 Experiments

#### 4.5.1 Threat Model and Evaluation

To attack the music recommendation system we once more use the C&W attack (Section 3.4), as it achieves a higher or similar SNR as Multi-Scale C&W in previous experiments but with a lower runtime complexity. This system requires us to rephrase the threat model slightly; in order to present a relevant attack to the recommender, our goal is here to distort its recommendations by increasing the number of times a particular song is recommended. In other words, we look for adversarial examples $\stackrel{^}{x}$, for which the original signal x and $\stackrel{^}{x}$ sound indistinguishable but $\stackrel{^}{x}$ is recommended more often than x, i.e., has a higher k-occurrence. The SNR once more acts as a proxy on how to represent how similar x and $\stackrel{^}{x}$ sound.

We again assume a white-box scenario for this attack, in which we know about the recommendation system and — in contrast to the model parameters required in the scenario of the instrument classifier — its underlying data. Additionally, we assume that the system allows user-generated contributions of data, as is the case for the Soundpark system. As these are considerable requirements, we discuss a possible relaxation thereof in Section 5.

To realise the attack, we make use of hub songs being per definition close to a large number of songs in the dataset in terms of their KL divergence, and try to push less recommended songs closer to these hub songs. For a song we want to perturb, we therefore first find a hub song t as the target for the C&W attack. Thereafter we try to push the current song towards the target by minimising the KL divergence of their respective Gaussians, leading to a modified objective (cf. Eqn. 3) of

(5)
$\begin{array}{ll}{L}_{total1}\hfill & = \parallel {\delta }_{k}\parallel { }_{2}^{2} +\alpha *{D}_{KL}\left({\mathcal{G}}_{x+{\delta }_{k}},{\mathcal{G}}_{t}\right).\hfill \end{array}$

Here DKL is the KL divergence, and Gx denotes the Gaussian of the MFCCs of a signal x, as computed in Section 4.2. The remaining variables correspond to the notation in Table 1.

An attack is considered to be successful for a song as soon as its k-occurrence Ok > hub_size, i.e., the song is promoted to a hub. As in previous experiments (Section 3.5.1), we evaluate the adversarial attack on the music recommendation system based on the number of adversarial examples that are found and their perceptibility in terms of the average SNR. Additionally, we look at the k-occurrence of successful adversarial examples but also analyse the improvement of k-occurrences after unsuccessful attacks.

#### 4.5.2 Implementation Details

As in Section 3.5.2, we use Adam to perform gradient descent and restrict C&W to perform a maximum of 500 iterations. The initial perturbation δ0 is set to zero. The target t for each file is chosen to correspond to the closest hub based on the smallest pairwise KL divergence between current song and any hub song. Preliminary experiments with targets being a random hub song or the largest hub based on k-occurrence did not provide any significant improvement over this approach. Experimental results are subsequently stated for one run, as the attack is deterministic.

When checking for the convergence of an attack, we compute the k-occurrence of an adversarial example in order to test its hubness. This, in turn, requires us to recompute pairwise KL divergences between the adversarial example and the remaining clean songs. For subsequent experiments, we use two simplifications that allow us to speed up this procedure. First, we only perform the convergence check every 10 iterations; secondly, we introduce a filter-and-refine approach, in which we first use KL divergences to approximate k-occurrences, before the actual k-occurrence is computed. For more information concerning the attack-specific parameters and the filter-and-refine approach, we refer to Section B.1 in our supplementary material and the code linked in Section 7, respectively.

#### 4.5.3 Results

Table 3 summarises the results of attacking the music recommendation system with C&W. We repeat the attack multiple times for different hub-sizes. These hub-sizes are used both for choosing potential target hub songs, as well as for determining whether an attack was successful for a particular sample. The different hub-sizes we use are listed in the first column of Table 3.

Table 3

Results of adversarial C&W attack on music recommendation system for varying hub-sizes. SNR and k-occurrence expressed by mean ± standard deviation over all adversarial examples, the number of which is indicated by the number in column 3.

Hub-size # Hubs (before) # Hubs (after) # Non-hubs (after) SNR k-occurrence

25 644 (4.1%) 6,381 (40.5%) 8,725 (55.4%) 39.12 ± 5.50 48.50 ± 31.42
50 203 (1.3%) 4,313 (27.4%) 11,234 (71.3%) 38.82 ± 5.02 85.34 ± 43.77
75 83 (0.5%) 3,080 (19.6%) 12,587 (79.9%) 38.83 ± 4.58 119.55 ± 56.05
100 32 (0.2%) 2,357 (15.0%) 13,361 (84.8%) 38.69 ± 4.33 153.05 ± 64.89
125 14 (0.1%) 2,244 (14.2%) 13,492 (85.7%) 38.46 ± 4.18 183.03 ± 71.89

Column 2 in Table 3 shows the number of songs that have an initial k-occurrence Ok > hub_size, i.e., the initial number of hub songs within the data. These songs are skipped during an attack, as they already fulfil the convergence requirement of being a hub. Columns 3 and 4 show the numbers of songs in the data for which the attack is successful/not successful in increasing their k-occurrence to be larger than hub_size. The sum of columns 2–4 is equal to the total number of files within the dataset (15,750).

The last two columns in Table 3 contain the average ± standard deviation of the SNR and the k-occurrence computed over successful adversarial examples.

When comparing the five entries in Table 3, we first observe that larger hub-sizes lead to a smaller number of initial hubs (column 2), and also make it increasingly more difficult to find successful adversarial perturbations (columns 3 and 4). For the smallest hub-size of 25 (line 2), which is also used by Flexer et al. (2018), C&W successfully promotes ~40.5% of all files to hub-songs; for the largest hub-size of 125 (line 5), on the other hand, it is successful for only ~14.2% of the files. The average SNR for adversarial examples remains similar for the different hub-sizes. As in previous experiments and suggested by listening examples in Section B.2 in our supplementary material, the SNR close to 40dB describes perturbations which often are imperceptible or perceptible as high-frequency noise during more silent passages of a song. The listening examples again represent both more- as well as less-perceptible examples. The last column in Table 3 shows that, due to its definition, the average k-occurrence of successful adversarial examples increases for increasing target hub-sizes.

After looking at the number of songs C&W successfully promotes to hub-songs for different hub-sizes, we additionally take a look, for a hub-size of 25, at how the k-occurrence Ok changes for files which do not pass the convergence check. Figure 3 shows the differences between the k-occurrence after 500 iterations of C&W and before the attack. The figure to the left shows the changes in k-occurrence for normal songs with an initial k-occurrence 0 < Ok ≤ 25; on the right hand-side, the figure shows the changes for initial anti-hubs with Ok = 0 (5,631 out of 15,750 files). The x-axis for both figures shows the difference between k-occurrence after and before the attack; the y-axis counts the number of times a particular increase/decrease occurs among all files we attack. Note here that an increase of more than 25 (x-axis) automatically promotes a song to a hub-song. We indicate an increase of the k-occurrence with bars colored in the darker color (changes > 0), and a decrease as well as unchanged k-occurrences with the lighter color (changes ≤ 0) in Figure 3. The attack manages to increase the k-occurrence for 92.92% of normal songs (Figure 3a), and is successful in doing so for 91.88% of initial anti-hubs (Figure 3b).

Figure 3

Histogram of changes in k-occurrence before and after the C&W attack on the music recommendation system for a hub-size of 25. Changes larger than zero denote an increase of the k-occurrence after an attack.

## 5 Discussion

Our experiments suggest waveform perturbations are a promising approach for adversarial attacks in MIR. All four adversarial attacks can significantly reduce the accuracy of our instrument classification system. The best method based on the highest average SNR is C&W, with an asymptotically lower runtime complexity compared to Multi-Scale C&W. Both targeted methods however find adversarial perturbations with higher SNRs than the untargeted methods FGSM and the PGD-attack.

Compared to Kereliuk et al. (2015), the closest study to ours published in MIR, our end-to-end application of adversarial attacks leads to perturbations with slightly higher average SNR and much more reduced system accuracy — however, we look at different tasks, with ours being instrument classification as opposed to genre recognition. As an advantage of the attacks presented in this work, we do not require any additional processing steps to ensure valid time-domain signals; this is particularly interesting if the time-domain signal is needed to determine the perceptibility of an adversarial perturbation.

Concerning the adversarial attack on the real-world music recommendation system, we were successful in severely distorting its recommendations. We managed to promote around ~40% of all files to so-called hub-songs, which are recommended very frequently, and at least increase the number of times a song is recommended for over ~90% of all files. This might also present ethical implications, since songs promoted to hub songs could gain an unfair share of revenue distributed via the attacked recommender while at the same time pushing other songs out of playlists and lowering their share. Ethical aspects of unfair treatment by biased recommendation algorithms have already been discussed in MIR in general (Holzapfel et al., 2018) and for music recommendation impacted by hubness in particular (Flexer et al., 2018).

In future work, we aim at extending this new approach by transferring attacks (Subramanian et al., 2020) to different recommendation systems thereby achieving black-box attacks, without full knowledge of these other recommenders. Another possibility to relax the white-box nature of our attack is to obtain estimates of frequently recommended songs by probing a recommender and reducing the distance to these estimated hub songs without precise knowledge of the underlying system. Future work will also investigate existing hubness-reduction methods (Feldbauer and Flexer, 2019) as potential methods to make a recommendation system robust against these kinds of attacks.

The existence of adversarial examples suggests that “being able to correctly label the test data does not imply that our models truly understand the tasks we have asked them to perform” (Goodfellow et al., 2015) and that impressive results of performance at almost human level might not use musical knowledge at all (Sturm 2013, 2014). This directly brings us to the question of validity, i.e., whether our instrument classification experiment is actually measuring what we intended to measure (Trochim and Donnelly, 2001; Urbano et al., 2013). Our real intention is to produce algorithms correctly classifying instrument sounds in general, not just the sounds in our dataset. If small perturbations, clearly not changing the perceived instrument characteristics, have such a large impact on the classification system, it must rely on a confounding factor, exposing a problem of internal validity, with no causal relation between type of instrument represented in the audio and instrument label returned by the classifier. Two variables that potentially influence measurements are called ‘confounded’ if the experimental design cannot disentangle their effects. In our case the type of instrument represented in the audio and the small adversarial perturbations are two such variables. A general framework for characterising effects of confounding in MIR experiments through regulated test conditions and interventions in the experimental pipeline has already been proposed (Rodríguez-Algarra et al., 2019). This also brings about a problem of external validity, with observed results not generalising from the original sound material to slightly changed versions thereof.

As for the real-life music recommendation system, the high intrinsic dimensionality of the sample space leads hub songs to be artificially close to many others, thereby hampering recommendation. Hubness is now acknowledged as a general problem of high dimensional machine learning and has been explored both theoretically and empirically using a wide range of distance/similarity measures (Radovanović et al., 2010) including KL divergences (Schnitzer et al., 2012) on which our music recommender is based. Our adversarial attack can exploit this problem and promote many songs to hub songs thereby boosting their occurrence in nearest neighbour based recommendation lists. Hubness therefore presents a problem of internal validity, since the mathematical problem of measuring distances in high dimensional spaces acts as a confounding factor at least partly explaining which songs are being recommended and which not.

After exposing these validity and reliability problems in this paper, our future work will also aim at explaining more thoroughly what the confounding factors are and how MIR systems that are robust against adversarial attacks could be designed.

## 6 Conclusion

In this work, we first applied four white-box adversarial attacks to an instrument classification system. Going beyond previous attacks, we computed adversarial examples directly on the waveform in an end-to-end fashion instead of on the spectrogram. The four attacks were compared with respect to various factors such as runtime complexity, number of adversarial examples that could be found and perceptibility of the adversarial perturbations based on the SNR and listening examples. The attacks could decisively reduce the accuracy of the system with (almost) imperceptible perturbations added to raw waveforms. Additionally, we proposed a new and motivating application for adversarial attacks in MIR by attacking a real-world music recommendation system. We here computed adversarial perturbations that increase the number of times a particular song is recommended, challenging the integrity of the system and questioning its validity.

## 7 Reproducibility

For reproducibility reasons, the code to run the experiments within this paper is available via Github.5

Supplementary Material

This contains the supplementary material as described in the text. DOI: https://doi.org/10.5334/tismir.85.s1

## Notes

2See supplementary_material.html in the supplementary material, or https://cpjku.github.io/adversaries_in_mir.

3http://fm4.orf.at/soundpark, since the Soundpark recommender interface is based on Adobe Flash Player, which is no longer supported since the beginning of 2021, the recommendations are currently not available; accessed 1st of June 2021.

4Hubness analysis is based on the scikit-hubness toolbox (Feldbauer et al., 2020).

## Acknowledgements

This work is supported by the Austrian National Science Foundation (FWF, P 31988).

## Competing Interests

Arthur Flexer is a member of the editorial board of the Transactions of the International Society for Music Information Retrieval. He was completely removed from all editorial processing. There are no other competing interests to declare.

## References

1. Akhtar, N., and Mian, A. S. (2018). Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6: 14410–14430. DOI: https://doi.org/10.1109/ACCESS.2018.2807385

2. Bhalke, D. G., Rao, C. B. R., and Bormane, D. S. (2016). Automatic musical instrument classification using fractional Fourier transform based MFCC features and counter propagation neural network. Journal of Intelligent Information Systems, 46(3): 425–446. DOI: https://doi.org/10.1007/s10844-015-0360-9

3. Carlini, N., and Wagner, D. A. (2018). Audio adversarial examples: Targeted attacks on speech-to-text. In Proc. of the IEEE Security and Privacy Workshops, pages 1–7. IEEE. DOI: https://doi.org/10.1109/SPW.2018.00009

4. Deldjoo, Y., Noia, T. D., and Merra, F. A. (2021). A survey on adversarial recommender systems: From attack/defense strategies to generative adversarial networks. ACM Computing Surveys, 54(2): 1–38. DOI: https://doi.org/10.1145/3439729

5. Du, T., Ji, S., Li, J., Gu, Q., Wang, T., and Beyah, R. (2020). SirenAttack: Generating adversarial audio for end-to-end acoustic systems. In Proc. of the 15th ACM Asia Conference on Computer and Communications Security, pages 357–369. ACM. DOI: https://doi.org/10.1145/3320269.3384733

6. Engel, J. H., Hantrakul, L., Gu, C., and Roberts, A. (2020). DDSP: Differentiable digital signal processing. In Proc. of the 8th International Conference on Learning Representations.

7. Feldbauer, R., and Flexer, A. (2019). A comprehensive empirical comparison of hubness reduction in high-dimensional spaces. Knowledge and Information Systems, 59(1): 137–166. DOI: https://doi.org/10.1007/s10115-018-1205-y

8. Feldbauer, R., Rattei, T., and Flexer, A. (2020). scikithubness: Hubness reduction and approximate neighbor search. Journal of Open Source Software, 5(45): 1957. DOI: https://doi.org/10.21105/joss.01957

9. Flexer, A., Dörfler, M., Schlüter, J., and Grill, T. (2018). Hubness as a case of technical algorithmic bias in music recommendation. In Proc. of the IEEE International Conference on Data Mining Workshops, pages 1062–1069. IEEE. DOI: https://doi.org/10.1109/ICDMW.2018.00154

10. Flexer, A., and Stevens, J. (2018). Mutual proximity graphs for improved reachability in music recommendation. Journal of New Music Research, 47(1): 17–28. DOI: https://doi.org/10.1080/09298215.2017.1354891

11. Fonseca, E., Plakal, M., Font, F., Ellis, D. P., and Serra, X. (2019). Audio tagging with noisy labels and minimal supervision. In Proc. of the Detection and Classification of Acoustic Scenes and Events Workshop, pages 69–73. New York University. DOI: https://doi.org/10.33682/w13e-5v06

12. Gasser, M., and Flexer, A. (2009). FM4 Soundpark: Audio-based music recommendation in everyday use. In Proc. of the 6th Sound and Music Computing Conference, pages 23–25.

13. Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015). Explaining and harnessing adversarial examples. In Proc. of the 3rd International Conference on Learning Representations.

14. Holzapfel, A., Sturm, B. L., and Coeckelbergh, M. (2018). Ethical dimensions of music information retrieval technology. Transactions of the International Society for Music Information Retrieval, 1(1): 44–55. DOI: https://doi.org/10.5334/tismir.13

15. Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. of the 32nd International Conference on Machine Learning, pages 448–456.

16. Kereliuk, C., Sturm, B. L., and Larsen, J. (2015). Deep learning, audio adversaries, and music content analysis. In Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 1–5. IEEE. DOI: https://doi.org/10.1109/WASPAA.2015.7336950

17. Kingma, D. P., and Ba, J. (2015). Adam: A method for stochastic optimization. In Proc. of the 3rd International Conference on Learning Representations.

18. Knees, P., Schnitzer, D., and Flexer, A. (2014). Improving neighborhood-based collaborative filtering by reducing hubness. In Proc. of the 4th International Conference on Multimedia Retrieval, page 161. ACM. DOI: https://doi.org/10.1145/2578726.2578747

19. Lostanlen, V., Andén, J., and Lagrange, M. (2018). Extended playing techniques: The next milestone in musical instrument recognition. In Proc. of the 5th International Conference on Digital Libraries for Musicology, pages 1–10. ACM. DOI: https://doi.org/10.1145/3273024.3273036

20. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In Proc. of the 6th International Conference on Learning Representations.

21. Mandel, M. I., and Ellis, D. P. (2005). Song-level features and support vector machines for music classification. In Proc. of the 6th International Conference on Music Information Retrieval, pages 594–599.

22. Nair, V., and Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In Proc. of the 27th International Conference on Machine Learning, pages 807–814.

23. Pachet, F., and Aucouturier, J.-J. (2004). Improving timbre similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1): 1–13.

24. Paischer, F., Prinz, K., and Widmer, G. (2019). Audio tagging with convolutional neural networks trained with noisy data. In Proc. of the Detection and Classification of Acoustic Scenes and Events Workshop.

25. Qin, Y., Carlini, N., Cottrell, G. W., Goodfellow, I. J., and Raffel, C. (2019). Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Proc. of the 36th International Conference on Machine Learning, pages 5231–5240.

26. Radovanović, M., Nanopoulos, A., and Ivanović, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11(86): 2487–2531.

27. Rodríguez-Algarra, F., Sturm, B. L., and Dixon, S. (2019). Characterising confounding effects in music classification experiments through interventions. Transactions of the International Society for Music Information Retrieval, 2(1): 52–66. DOI: https://doi.org/10.5334/tismir.24

28. Schedl, M. (2019). Deep learning in music recommendation systems. Frontiers in Applied Mathematics and Statistics, 5: 44. DOI: https://doi.org/10.3389/fams.2019.00044

29. Schnitzer, D., Flexer, A., Schedl, M., and Widmer, G. (2012). Local and global scaling reduce hubs in space. Journal of Machine Learning Research, 13(1): 2871–2902.

30. Schönherr, L., Kohls, K., Zeiler, S., Holz, T., and Kolossa, D. (2019). Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. In Proc. of the 26th Annual Network and Distributed System Security Symposium. The Internet Society. DOI: https://doi.org/10.14722/ndss.2019.23288

31. Sturm, B. L. (2013). Classification accuracy is not enough: On the evaluation of music genre recognition systems. Journal of Intelligent Information Systems, 41(3): 371–406. DOI: https://doi.org/10.1007/s10844-013-0250-y

32. Sturm, B. L. (2014). A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia, 16(6): 1636–1644. DOI: https://doi.org/10.1109/TMM.2014.2330697

33. Sturm, B. L. (2016). The “horse” inside: Seeking causes behind the behaviors of music content analysis systems. Computers in Entertainment, 14(2): 1–32. DOI: https://doi.org/10.1145/2967507

34. Subramanian, V., Pankajakshan, A., Benetos, E., Xu, N., McDonald, S., and Sandler, M. (2020). A study on the transferability of adversarial attacks in sound event classification. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 301–305. IEEE. DOI: https://doi.org/10.1109/ICASSP40776.2020.9054445

35. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014). Intriguing properties of neural networks. In Proc. of the 2nd International Conference on Learning Representations.

36. Trochim, W. M., and Donnelly, J. P. (2001). The Research Methods Knowledge Base. Atomic Dog Publishing, Cincinnati, 2nd edition.

37. Urbano, J., Schedl, M., and Serra, X. (2013). Evaluation in music information retrieval. Journal of Intelligent Information Systems, 41(3): 345–369. DOI: https://doi.org/10.1007/s10844-013-0249-4

38. Zhang, W. E., Sheng, Q. Z., Alhazmi, A. A., and Li, C. (2020). Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology, 11(3): 24: 1–41. DOI: https://doi.org/10.1145/3374217