User Models for Culture-Aware Music Recommendation: Fusing Acoustic and Cultural Cues

Integrating information about the listener’s cultural background when building music recommender systems has recently been identified as a means to improve recommendation quality. In this article, we, therefore, propose a novel approach to jointly model users by their musical preferences and cultural backgrounds . We describe the musical preferences of users by the acoustic features of the songs the users have listened to and characterize the cultural background of users by culture-related socio-economic features that we infer from the user’s country. To evaluate the impact of the proposed user model on recommendation quality, we integrate the model into a culture-aware recommender system . By analyzing a dataset comprising approximately 400 million listening events of about 55,000 users from 36 countries, we show that incorporating both acoustic information of the tracks a user has listened to as well as the cultural background of users in the form of a music-cultural user model contributes to improved recommendation performance. Furthermore, we provide a systematic analysis of the influence of different features on the quality of the provided culture-aware track recommendations. We find that considering acoustic features that model the characteristics of tracks and a user’s musical preferences have the highest impact on recommendation performance. However, adding socio-economic features allows further improving the recommendation quality. In addition, we identify interesting correlations between acoustic characteristics of music preferences and cultural features of populations at the country level.


Introduction
Recent advances in recommender systems and music information retrieval have shown that contextual infor mation is vital for highly personalized results (e.g., Wang et al. (2012a); Braunhofer et al. (2013); Pichl and Zangerle (2018)). In this scope, context can be defined as "conditions or circumstances which affect some thing" Adomavicius and Tuzhilin, 2011), where, e.g., environmentrelated contextual information may include location, time or weather . Consequently, the user's listening context can be defined as the user's context during listening to music. To this end, the geographic location of a user is often exploited as one basic notion of context. Leveraging GPS coordinates to model similarity between listeners, which is key to building recommender systems, results in locationaware systems, which are however agnostic to cultural characteristics and the cultural background of users. In the scope of this article, we define the cultural background of users as a set of attributes that allow for describing the culture the user is embedded in, including social or economic aspects, as well as, e.g., cultural practices, values, and behavior. However, location alone does not necessarily serve as a good indicator for the cultural background of a user, as geographically close users might have a very different cultural background. A user's cultural background may also not coincide with political borders (Pichl et al., 2017). Notably, the cultural background of a user was identified already by Schedl and Schnitzer (2014) as a possibly relevant aspect to improve recommender systems. We hence argue that modeling users based on musical properties of the songs they listen to (approximating their musical preference) on the one hand and the user's cultural background on the other contributes to capturing music-cultural listening patterns. These patterns particularly describe the complex interrelation between users, their cultural background, and the characteristics of the music they listen to. In this article, we propose a novel music-cultural user modeling approach to exploit such listening patterns for recommender systems by integrating information about (i) the acoustic qualities of the music users have listened to and (ii) culturespecific information derived from the users' location/country to describe the user's likely cultural background.
Leveraging a standardized collection of almost one billion usergenerated listening events, we evaluate the proposed user model. 1 By exploiting musiccultural listening patterns captured by the proposed user model in a recommender system, we show that the resulting cultureaware music recommendations are more accurate than those provided by a recommender agnostic to cultural information. Particularly, we find that capturing a user's individual music taste by the highlevel audio features of the tracks the user has listened to and adding Hofstede's cultural dimensions (Hofstede et al., 1991) as well as data from the World Happiness Report (WHR) (Helliwell et al., 2016) as a description of the cultural (and socioeconomic) background of the user provides the best recommendation results, in terms of accuarcy and error measures.
The remainder of the article is organized as follows. Section 2 briefly reviews related work on context and cultureaware music recommendation. The dataset we use, a processed version of the LFM1b dataset (Schedl, 2016), is presented in Section 3. Section 4 provides details on (i) our methods for user modeling according to musical preferences and cultural aspects, and (ii) our proposed cultureaware recommender system. The experiments we conducted to evaluate the user models and recommender system approaches are explained in Section 5. We present and discuss the results obtained in Section 6. To gain more insights into the overall and countryspecific patterns of acoustic music preferences, Section 7 presents results of an additional study on differences in acoustic preferences between countries and on correlations between cultural and musical features. The paper is rounded off by a summary and outlook to followup research in Section 8.

Related Work
In music recommender systems, unlike for instance in movie recommendation, contentbased approaches have been the dominant focus of research for a long time (Knees and Schedl, 2016). Music content is, in this case, either incorporated into the recommendation algorithm in the form of handcrafted acoustic features or-more recentlyby automatic feature extraction from the raw audio signal using deep neural networks. Examples of the former include a rich set of features that have been proposed in the past two decades of music information retrieval research, and range from Mel frequency cepstral coefficients (MFCCs), e.g., Logan (2002), to semantic descriptors of acoustic properties, e.g., Miotto et al. (2010); Turnbull et al. (2008). For an overview, consider, for instance, Casey et al. (2008); Knees and Schedl (2013). Deep learningbased approaches to automatic feature learning for contentbased music recommendation include convolutional neural networks (CNN) and recurrent neural networks (RNN), in particular their variants long shortterm memory (LSTM) and gated recurrent units (GRU). For a more detailed review of deep learning approaches in music recommendation, please consider Schedl (2019).
Nowadays, it has become widely accepted that incor pora ting contextual information into recommender systems contributes to improved recommendations (Adomavicius and Tuzhilin, 2011). Particularly for music recommender systems, studies showed that users often seek for music that matches their current situation, and hence context (i.e., occasion, event or emotional states) (Kim and Belkin, 2002;Lee and Downie, 2004). In the scope of music recommender systems, Kaminskas and Ricci (2012) distinguish environ mentrelated context (location, time or weather), user related context (activity, demographic information or emotional state of the user), and multimedia context (text or pictures the user is currently reading or looking at). For our study, the environmentrelated context of a user is of particular relevance as we aim to leverage both the musical preferences and cultural background of users for improving track recommendations. Schedl and Schnitzer (2013) performed a study on the contribution of geospatial information to the performance of artist recommender systems. They conclude that if users listen to various different artists, the integration of geospatial information is beneficial.  approximate the cultural distance of users by the country or continent a user is located in and show that this is beneficial for users particularly in the U.S. and Russia. Furthermore, there are several approaches that exploit places of interest as contextual information, where the idea is to recommend music that suits the environmentin an emotional or cultural sense Braunhofer et al., 2011). Rich sensory devices such as smart phones allow mapping a certain location to a certain activity that can be exploited for personalized location based music recommendations, depending on the user's inferred activity (Wang et al., 2012b). Baltrunas et al. (2011a) propose a contextaware music recommender system for car drivers, where a set of diverse contextual factors are incorporated (e.g., driving style, traffic conditions, weather or road type). Ankolekar and Sandholm (2011) propose the Foxtrot system, which allows users to tag music with geolocations. Based on this information, users can be provided with locationspecific music recommendations. Cheng and Shen (2014) model the listener's shortterm music needs, their location, and the music's overall popularity to create personalized music recommendations. Hu and Ogihara (2011) propose a music recommender system that integrates track genre, release year, freshness, and temporal aspects.
As for cultural aspects in the broader field of music information retrieval, Ferwerda and Schedl (2016) found that a user's cultural background (modeled by Hofstede's cultural dimensions (Hofstede et al., 1991)) influences how diverse the musical preferences of users are. Particularly, they found that highly individualist countries and countries that are flexible, pragmatic, and eager to adapt to changes listen to more diverse genres. Schedl et al. (2017) also performed a study on whether cultural similarity between countries (described by Hofstede's cultural dimensions and the Quality of Government (QoG) dataset) is reflected in music taste (described by tags annotating music tracks). They found medium correlations of music taste and several cultural and socio economic factors. Notably, this evaluation is based on the LFM1b dataset, which is also utilized in the experiments conducted in this study. Furthermore, Liu et al. (2018Liu et al. ( , 2017 have uncovered similarities between countries based on cultural and socioeconomic aspects on the artist level and on the album level. Pichl et al. (2017) clustered users based on their individual musical preferences and their cultural characteristics. Relying on densitybased spatial clustering, they find nine clusters that describe similar users regarding both their musical preference and cultural background. The cultural background of users was described by the World Happiness Report (Helliwell et al., 2016) and the authors found that incorporating cultural information allows for more precise user descriptions compared to relying on geographic information only. However, this evaluation did not target recommender systems and was done on a substantially smaller dataset.
We are not aware of any work exploiting the cultural background of users for the computation of contextaware music recommendations and hence locate a research gap here. In this paper, we show that utilizing the cultural background of users together with their general musical preference contributes to improved recommendation quality.

Data
In this section, we present the data utilized for performing our analyses and experiments.
For our analyses, we require a dataset that contains a substantial number of listening histories of users as well as country information about these users. There are indeed a number of datasets containing listening histories: the Million Musical Tweets Dataset (Hauger et al., 2013) and the MusicMicro dataset (Schedl, 2013) come with contextual information related to time and location. The musical listening histories dataset (Vigliensoni and Fujinaga, 2017), the Yahoo! Music ratings dataset (Dror et al., 2012) and the #nowplaying dataset (Zangerle et al., 2014) contain a substantial number of users, items also including timestamps of LEs; however, no contextual information regarding the user's country is given. Hence, we base our investigations on the LFM1b dataset (Schedl, 2016), which contains more than one billion listening events created by users of the online music platform Last.fm, 2 where music listeners can share information about their listening behavior. The LFM1b dataset has been created in the following way using various endpoints of the Last.fm API (Schedl, 2017): first, the top artists labeled by any of the 250 top usergenerated tags used on Last.fm were retrieved. Then, the top fans of these artists were fetched, resulting in about 465,000 users. Listening histories (i.e., each user's set of listening events) of a randomly chosen subset of 120,322 users were subsequently downloaded. The creation times of the listening events cover the time span between January 2005 and August 2014.
Since we aim to model musiccultural preferences jointly by individual musical preference and the cultural background of users, we require the data to contain information about the location of the user. For 45.87% of all users within the LFM1b dataset, country information about the user is available. Therefore, we constrain the dataset to those users (and their tracks) for whom we are able to obtain country information. This provides us with a dataset comprising 55,191 users, who have listened to a total of 26,022,625 distinct tracks, which are captured by a total of 807,890,921 listening events.
Besides the information contained in the LFM1b dataset, we also require information about the tracks the users listened to (cf. Section 4.1). Particularly, we are interested in content features that are able to describe a given track. Therefore, we rely on the Spotify API to gather contentbased audio features, as described in Section 4.1, for each track. For all listening events of users for whom we can obtain country information, we search for the <track, artist, album> triples extracted from the LFM1b dataset using the Spotify search API 3 to gather the Spotify URI of each track (i.e., we provide all three parts in a conjunctive query). This URI is subsequently used to query the audio features API, 4 which returns the set of audio features describing the contents of a given track (cf. Section 4.1), which allowed gathering 4,326,809 Spotify URIs. For the remainder of the tracks, the Spotify API is not able to correctly resolve the triples to a track. We attribute this to two factors: either the searched track is not provided by Spotify or the track, artist, and album information cannot be matched to a Spotify track URI unambiguously. Also, the Spotify API does not provide all features for all tracks and hence, we remove those tracks for which the API does not provide a full set of audio features from the dataset. Employing this procedure, we are able to acquire the full set of audio features for a total of 3,478,399 tracks. Notably, these 13.36% of the distinct tracks for which we can obtain audio features are able to capture 48.89% of all listening events (i.e., the tracks listened to by users).
The remaining tracks and respective listening events are excluded from the dataset. This eventually results in a dataset of 55,149 users, 394,944,868 listening events and 3,478,399 distinct tracks. Table 1 depicts the main characteristics of the dataset underlying our analyses. 5 As can be seen, the average number of listening events per user is 7,161, which we consider a substantial number that is able to capture a user's individual musical preferences well. Furthermore, the average number of users per country is 1,156. Along the lines of Ferwerda and Schedl (2016), we constrain the dataset to countries with more than 200 users to ensure that countries are wellcharacterized and results are valid and representative (at least of a typical music streaming community such as the one at Last.fm).

Methods
In the following, we detail the proposed approach for leveraging individual and cultural listening patterns for the computation of track recommendations based on the underlying dataset (as described in Section 3). We first present our user modeling approach (for individual and cultural listening patterns) and secondly present the proposed musiccultural user model. Subsequently, we show how we leverage this model for the computation of track recommendations.

User Modeling: Musical Preferences
As for modeling individual musical preferences, we gather contentbased audio features for each of the tracks in the dataset by querying the Spotify API 7 -following the lines of, e.g., Pichl et al. (2016); Andersen (2014); McVicar et al. (2011). We make use of these Spotify highlevel features for a number of reasons: first, the LFM1b dataset does not contain audio data that we could use to extract audio features from. Second, our analyses aim at investigating the general suitability of merging acoustic and cultural cues for music recommendation rather than lowlevel feature engineering and hence, we rely on Spotify's audio features as a compact characterization of tracks. These content features are extracted from the audio signal of a track and comprise: 1. Danceability describes how suitable a track is for dancing and is based "on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity." 2. Energy measures the perceived intensity and activ ity of a track. This feature is based on the dynamic range, perceived loudness, timbre, onset rate and general entropy of a track. 3. Speechiness detects presence of spoken words in a track. High speechiness values indicate a high degree of spoken words (talk shows, audio book, etc.), where as medium to high values indicate e.g., rap music. 4. Acousticness measures the probability that the given track contains only acoustic instruments. 5. Instrumentalness measures the probability that a track contains no vocals (i.e., it is instrumental). 6. Tempo quantifies the rate of the beat in beats per minute. 7. Valence measures the "emotional positiveness" con veyed by a track (i.e., cheerful and euphoric tracks reach high valence values). 8. Liveness captures the probability that the track was performed live (i.e., whether an audience is present in the recording).

User Modeling: Cultural Aspects
As for the cultural dimension, we propose to model cultural aspects on a country level and make use of two different resources: Hofstede's cultural dimensions  (Hofstede, 1980;Hofstede et al., 1991) 8 and the World Happiness Report 9 of 2016 (Helliwell et al., 2016), which we describe in the following. A widely accepted instrument to describe cultures is Hofstede's cultural dimensions (HOF). This framework describes a nation's culture and values by the following six dimensions: 1. Power distance (PD) is defined as "the extent to which the less powerful members of organizations and institutions (like the family) accept and expect that power is distributed unequally" (Helliwell et al., 2016). 2. Individualism (IDV) captures the extent to which people are integrated into groups. Societies with high scores possess only loose ties and the individu al is considered more important than the collective group. 3. Masculinity (MAS) assesses a preference in society for achievement, heroism, assertiveness and mate rial rewards for success. Low masculinity (femininity) signals a preference for cooperation, modesty, caring for the weak and quality of life. 4. Uncertainty avoidance (UA) measures to which degree members of a society tolerate ambiguity. Countries with a high score tend to rely on stiff codes, guide lines, and laws. In contrast, lower scoring countries show more tolerance and acceptance of differing thoughts. 5. Long-term orientation (LTO) measures the connection of the past with current and future actions or chal lenges. Lowscoring societies tend to keep traditions and norms and are suspicious of societal change, while highscoring societies encourage thrift and adaptation. 6. Indulgence (IND) captures the happiness of a coun try and "relatively free gratification of basic and natural human drives related to enjoying life and having fun". In countries with low indulgence scores, gratification of needs is suppressed and regulated by strict social norms.
In addition to Hofstede's cultural dimensions, we com plement our model with socioeconomic characteristics of countries. We capture these by figures extracted from the World Happiness Report (WHR) (Helliwell et al., 2016). Schimmack et al. (2002) showed that cultural factors are directly influenced by the subjective wellbeing of people. Therefore, we rely on the WHR as it captures people's cognitive and affective evaluations of their daily life and thus, their subjective wellbeing (Diener, 2000) on a country level. The WHR provides the following set of measures capturing the perceived happiness of countries: 1. Freedom measures the perceived freedom to make life choices. 2. Healthy life expectancy captures the healthy life expectancy at birth in a given country. 3. Generosity specifies whether people in a country are willing to spend money on a charity. 4. Social support states if people have people helping them if they need support (i.e., relatives or friends).

5.
Trust measures the publicly perceived absence of corruption in government and business. 6. Happiness quantifies the subjective and perceived happiness. 7. GDP is the real gross domestic product per capita.

Music-Cultural User Model
Based on the features we leverage to capture a user's musical preferences (Section 4.1) and a user's cultural background (Section 4.2), we propose the following musiccultural user model for computing cultureaware recommendations.
Generally, we characterize a user's individual musical preferences and cultural background in a single feature vector. As for capturing a user's individual musical preferences based on the tracks listened to, we leverage the audio features of tracks as presented in Section 4.1. Except for tempo, all of these features are given in the range of [0,1]. For tempo, we apply a linear minmax scaling to also represent it in the range of [0,1]. To exclude tracks with audio features that distort a user's aggregated musical features, we remove outlier tracks from the user's listening history by applying the median absolute deviation (MAD) outlier detection method (Leys et al., 2013). We consider a feature value an outlier if it is not within M ± a · MAD, where M is the median of this particular feature across all tracks of a user and MAD is the median absolute deviation of these values. As for the choice of a, we set a strongly conservative threshold a = 3 as proposed by Leys et al. (2013). Hence, a value is considered an outlier if it is not within three MADs around the median. Lastly, a track is considered as an outlier in the list of tracks of a particular user if one of its features is considered an outlier and consequently removed from the user's listening history. For each of the features, we compute the average feature value and the standard deviation across all tracks in the user's listening history and add these average and standard deviation (SD) values to the user's feature vector. We chose to add the standard deviation of each of these features to mitigate the effects of averaging a large number of features that potentially differ substantially.
For the approximation of the cultural background of users (or rather, the country they live in) by socio economic aspects, we rely on the variables of Hofstede's cultural dimensions and the World Happiness Report and extract these based on the user's country information. We add these variables to the feature vector to find cultural listening patterns that reflect cultural similarity better than the geographic distance. For each of these variables, we perform a linear minmax scaling such that all elements of the vectors are within [0, 1] and concatenate it with the user vector.

Recommendation Computation
We model the computation of contextaware music recommendations based on the proposed user model as a learning task for rating prediction, where we aim to learn the probability P that a given user u has listened to a given track t. To learn these probabilities P(u, t) for all users and tracks, we rely on Gradient Boosting Decision Trees.
Particularly, we utilize the popular XGBoost system (Chen and Guestrin, 2016), a scalable endtoend tree boosting approach which has been shown to be highly suited for recommendation tasks (Pacuk et al., 2016;Ayaki et al., 2017;Tran, 2016). Using XGBoost, we set the learning objective to logistic regression for binary classification, which provides us with the desired probabilities. For the training phase, we set the training objective to be the binary classification error rate (i.e., the number of wrongly classified tracks in relation to all tracks classified, where tracks with a prediction value larger than 0.5 are classified as relevant for the given user, and all other tracks are considered irrelevant for the user).
Please note that we deliberately chose a classification based recommendation approach and refrained from utilizing more elaborate recommender approaches such as contextaware matrix factorization (Baltrunas et al., 2011b) or tensorbased factorization approaches (Karatzoglou et al., 2010) as we aim to focus on user modeling aspects in this paper. Hence, we chose to compare different user models based on a simple classificationbased recommen dation approach which also allows us to get a deeper understanding of the contribution of individual features of the user model (cf. Section 6).
For the classification task carried out, we require a rating for each track that allows us to define whether a given track was listened to (and thus, considered relevant) for a given user. Hence, we add a binary factor (rating) to the processed dataset: for each unique <user, track> combination, the rating r i,j is 1 if the user u i has listened to track t j at least once. Please note that users and tracks may be represented by different models as described in Section 5.1.2. Due to a lack of publicly available data, our dataset does not contain any implicit feedback of users (i.e., skipping behavior, session durations, or dwell times during browsing the catalog). This is why we cannot estimate any preference towards an item a user has not listened to as proposed by Hu et al. (2008). Thus, we assume tracks the user has not listened to (in the case of implicit data, all nonobserved tracks) as negative examples (Hu et al., 2008). Even though there is a certain bias towards negative values as some missing values might be positive, Pan et al. (2008) found that this method for rating estimation works well. The rating r i,j for a given user u i and given track t j can now be defined as stated in Equation 1.
We train an XGBoost model that performs a binary classification on the relevance of tracks for the given users. We extract the probabilities underlying the classification decision, which can be used to (i) perform a ranking of tracks by their probability of relevance in the recommendation task which allows us to conduct a rankingbased evaluation of the proposed models, and (ii) evaluate the predictive performance of the proposed models by computing error metrics.

Experiment Design
This section reports on the experiments conducted for evaluating the previously described cultureaware recom mender system.

Experimental Setup
In the following, we first present the user models evaluated and describe the evaluation method utilized for capturing the recommendation performance of the proposed user model.

Evaluation Strategy
To evaluate the performance of the proposed contextual user modeling in regard to recommendation quality, we perform a peruser evaluation. Therefore, we use each user's listening history and perform a leave-k-out evaluation per user (also referred to as holdout evaluation) (Cremonesi et al., 2008;Breese et al., 1998;Cremonesi et al., 2008), where we set k to 50 (as described later in this section).
The underlying dataset only provides items with positive feedback (Hu et al., 2008) (i.e., items that have been listened to by the user) gathered via users' listening histories. As the recommendation task is transformed into a rating prediction task, we require the dataset to also include negative examples. Therefore (and as described previously in Section 4.4), for each user, we randomly add tracks the user did not interact with (i.e., tracks t j with r i,j = 0 for the given user u i ) to the dataset until the listening history of each user in both the training and test sets are filled with 50% relevant and 50% nonrelevant items for the user. We chose to oversample the positive class to avoid class imbalance and hence, a bias towards the negative class (the number of tracks not listened to is much larger than the number of tracks listened to, for all users).
As we aim to evaluate the benefit of adding cultural aspects in a track recommendation scenario, we also need to characterize tracks. For our proposed model, we rely on the acoustic features of each track and add these to the track vector. However, we also need to assign cultural features to tracks to be able to match users of a certain culture with tracks that are listened to by users with a similar cultural background. This is particularly relevant for tracks in the negative class. Preliminary experiments showed that we cannot assign randomly computed cultural features or the cultural features of the current user to tracks as this causes the XGBoost model to learn that all tracks with the user's culture assigned belong to the positive class, whereas all tracks from any other culture (i.e., culture information that is consistent across a number of users or purely random culture information) belong to the negative class. Therefore, we propose to assign the cultural features of the country in which the track is most popular to each track. We argue that the track is most characteristic and representative for the country in which the track is most popular. Therefore, we first compute the playcounts of each track in each country within the dataset. Next, we normalize the playcount (PC) of each track t ∈ T (i.e., the universe of tracks in the dataset) in each country c by the total amount of listening events of the country (i.e., we compute for each country c and for each track t). This allows us to infer the country in which it accounts for the highest share of listening events and hence, is most popular. We subsequently assign the culture of this country to the track. For obtaining negative samples (tracks), we randomly select a track from the dataset that the current user has not listened to and again assign this track the cultural features of the country where the track is most popular.
Based on the dataset that now contains an equal amount of positive and negative samples for each user, we use a leavekout evaluation strategy. Therefore, we have to compute a holdout set of size k for each user: along the lines of previous research (He et al., 2017;Elkahky et al., 2015), we randomly select 50 positive samples (tracks that the user has listened to) and 500 negative samples (tracks the user has not listened to). These 550 tracks form the test set for each user, whereas the recommender system is trained on the remainder of the dataset. Subsequently, we compute the predicted ratings for the tracks in the test set as presented in Section 4.4, aiming to rank the 50 positive samples on top, whereas the negative samples should be ranked on the bottom of the ranked list of recommendations.

Evaluated Models and Baselines
To assess the performance of each of the proposed user models, variations thereof and two baseline approaches in terms of recommendation quality, we separately evaluate these different user models and compare their performance. An overview of the evaluated modeling approaches is depicted in Table 3. The evaluated models describe a user either by the user's individual music preferences described by the acoustic features of the tracks the user listened to (U_AF), the user's cultural/socio economic background described by Hofstede's dimensions (U_HOF) and the World Happiness Report (U_WHR), or the user identifier (U_ID). Similarly, we describe tracks by their acoustic features (T_AF), the culture they are embedded in (T_HOF and T_WHR) or by their track identifier (T_ID). Please note that we include the user and track identifiers in the respective models as this allows us to extend and directly compare the approaches to a baseline model (User + Track), that is only based on these two identifiers. As can be seen from Table 3, we evaluate the musiccultural model (Music + Culture) as proposed in Section 4.3. We also individually evaluate the performance of a model solely relying on musical preferences of users and features of tracks (Music model), and analogously a model that describes users and tracks by their cultural background (Culture model).
Furthermore, we investigate a set of baselines to compare our proposed models to. First, we evaluate an approach that uses each user's listening history and additionally, utilizes the user's country code (e.g., US for users from the United States) as contextual information for both the user and the track (Country model). Here, we aim to evaluate whether the country code may act as a proxy for cultural factors of users. Furthermore, we evaluate a context agnostic baseline relying solely on the users' listening histories and hence, a model that solely relies on the user and track ids for classification (User + Track) in a traditional collaborative filtering approach.

Evaluation Metrics
We model the contextaware recommendation of tracks as a rating prediction task, therefore we use the root mean squared error (RMSE) and mean absolute error (MAE) to measure the prediction error. We compute the RMSE and MAE for each individual user and consequently compute the average among all users. Furthermore, we are also interested in a decisionbased evaluation (Celma, 2010) of our approach and therefore, compute precision, recall, and the F 1 measure to assess the topn accuracy (Cremonesi et al., 2010), where n is the number of topranked track recommendations that is evaluated. Therefore, we require the set of computed recommendations to be ranked. Hence, we rank the track recommendation candidates with respect to the probability that they belong to the positive class in descending order and compute the topn track recommendations. Next, we have to transform the rating prediction task into a binary classification task (Pan et al., 2008) for deciding whether a given track is relevant or not for a given user. For our experiments, we consider all predicted probabilities P(u, i) > 0.5 as a predicted interaction and thus, we consider these items as relevant, all others as irrelevant. 10 For assessing the overall precision, recall, and F 1 measure of the evaluated recommender systems, we compute the measures for each individual user and compute the average among all users. For computing the recall measure, all relevant items in the test set are considered, independent of the number of recommendations. Thus, there is a natural cap for recall, namely the number of recommendations divided by the number of relevant items in the test set.
Regarding the number n of evaluated recommendations, we argue that exposing a user to more than 10-20 tracks at a time might provoke choice overload and hence, is barely meaningful. The problem of choice overload has been addressed by Bollen et al. (2010) who state that user satisfaction is highest when presenting the user with top5 to top20 items-assuming that the recommendation list contains a sufficient number of relevant items for the user. Hence, we are particularly interested in the performance of the proposed recommendation approaches for lower values of n. Furthermore, we argue that in the presented scenario, precision is the more important measure to consider from a user perspective as it able to capture the user's effective utility of the provided recommendations better (Bellogin et al., 2011) and hence, the practical value of the recommender system for the user. Thus, we argue that particularly the precision@10 results are relevant for our evaluation. As for the tuning of XGBoost parameters, we performed a preliminary crossevaluation aiming to optimize precision values for the proposed models and hence, set the maximum number of trees to learn the models to 1,000. For all other parameters, we rely on the default settings.

Experimental Results and Discussion
In the following, we first present the findings of the topn recommendation evaluation task (Section 6.1), before presenting the evaluation of the underlying rating prediction task in Section 6.2. Subsequently, we elaborate on the importance of individual features of the proposed user model (Section 6.3) and discuss the limitations of the approach (Section 6.4). Table 4 shows the results obtained by the evaluated user models (cf . Table 3), where we consider the top10 ranked recommended tracks for the evaluation. Regarding the precision of the computed recommendations, we observe that the best results are obtained by the proposed Music + Culture model, which incorporates both the user's general musical preferences and the cultural background of the user. This model reaches a precision@10 of 0.98, whereas the Music model reaches a precision of 0.95 and the Culture model a precision of 0.31, respectively. Compared to the baselines, we observe that using only the country of the user as a proxy for cultural aspects (Country model) achieves a precision value of 0.83, whereas the User + Track model performs worse, reaching a precision value of 0.13. Regarding the recall values obtained, we observe that again, the Music + Culture model performs best (0.63), followed by the Music (0.59) and Country (0.52) models. The User + Track baseline again reaches a lower value (0.08), whereas the Country model again performs well (0.52). For the sake of completeness, we also list the F 1 values obtained by the individual models, which are consistent with the individual findings regarding recall and precision. In preliminary baseline experiments, we have also compared our approach with a traditional contextagnostic matrix factorization approach. Singular value decomposition based on implicit feedback achieved a precision of 0.49, a recall of 0.10, and an F 1 score of 0.17. As already elaborated, we consider the precision metric more relevant in this scenario. Thus, these baseline results show that the proposed models do indeed contribute to recommendation quality. Figure 1 shows a precision/recall plot of the evaluated approaches for n = 1…50 track recommendations. From this plot, we again observe the superior performance of the musiccultural user model across all evaluated lengths of recommendation lists n. The plot also highlights the difference between the two models that incorporate acoustic features for describing musical preferences (Music + Culture and Music) and the remaining user models that do not exploit this information, where precision and recall are both substantially lower. These findings underline that the musical preference of users is paramount for recommendation scenarios. We can also observe that using the user's country as a proxy for their cultural background does indeed contribute. Naturally, including a set of cultural features to describe the user's cultural background also allows to exploit a more comprehensive, multidimensional notion of similarity between users (Schedl and Schnitzer, 2013), which can be exploited by the recommender system. We also have experimented with combining musical features and country code, however, this did not increase performance compared to using only musical features.

Rating Prediction Evaluation
Besides the decisionbased evaluation regarding recall and precision, we are also interested in the prediction accuracy of the individual user models.

Influence of Features
Apart from the performance of the proposed music cultural user model in regard to recommendation quality, we are also interested in the contribution of the individual features of the user model to the trained XGBoost classification model. Therefore, we utilize the gain of each feature in the XGBoost model (Chen and Guestrin, 2016), which is a measure for the improvement in accuracy when adding a split on the given feature to the tree. This gain is computed for each feature in every tree of the trained model and is then averaged to a final gain value for each feature. Figure 2 shows the contribution of the top30 individual features to classification performance of the proposed musiccultural user model. Please recall that in the proposed model, both users and tracks are described by musical and cultural features (cf . Table 3). Hence, we color the bars of user features in blue and track features in red. In total, acoustic features account for 93% of the gain (76% user features, 17% track features), WHR features account for 4% and Hofstede's dimensions for 3% of the gains.
The results show that the major contributing features are related to the acoustic features that describe the user's musical preference and the tracks. This high importance of acoustic features when it comes to describing users is congruent with the analyses of Pichl et al. (2017) and in line with the findings of the topn recommendation evaluation, where the Music model was the second best performing model. The features that contribute most to the classification accuracy (and hence, recommendation performance) are the average acousticness (user_acousticness_avg), instrumentalness (user_instrumentalness_avg) and danceability (user_ danceability_avg) of tracks the user has listened to. As for the track features, acousticness and instrumentalness are also the main contributing features. This high contribution of instrumentalness and acousticness is in line with previous findings (Pichl et al., 2016), where these two features have been shown to discriminate tracks well in a principal component analysis. These findings are also congruent with the results of the evaluation conducted, where the user model that solely relies on the user's preferences achieved the second best recall and precision values (performing substantially better than the Culture, Country, and User + Track models). However, while socio economic factors are not among the top contributing features, socioeconomic features nevertheless contribute to the recommendation quality and make a decisive difference regarding recommendation performance. The user features contributing most are healthiness, social support, happiness, GDP and masculinity and for tracks, the happiness and social support features provide the highest gain. While WHR features contribute more in our scenario, features stemming from both sources (WHR and Hofstede's cultural dimensions) are among the top contributing features; this also supports our choice to include both social and economic features in the user model as both contribute to higher recommendation performance.

Discussion and Limitations
We believe that the proposed musiccultural user model and the conducted evaluation are an important first step towards cultureaware music recommender systems. The obtained results show that the proposed musiccultural user model outperforms all other evaluated models. However, we still see a few limitations of our approach, which we will elaborate on in the following. First, we currently represent the musical preferences of a user by utilizing the average of the acoustic features of the tracks the user has listened to and the standard deviation thereof. While we believe that this method is sufficiently elaborate for the experiments conducted, this is a rather naive approach towards representation and does not reflect the diverse and often contextrelated musical preferences of  users. Similarly, we currently use a rather simple majority voting approach for assigning cultural features to tracks. However, in the paper at hand, we are particularly interested in the influence of individual features and characteristics of users, their cultural background, and tracks on the recommendation performance and, hence, deliberately refrain from utilizing a more comprehensive user model. Nevertheless, looking into creating more comprehensive and complex user models based on the cultural background of users is part of our future research agenda. For instance, Zangerle and Pichl (2018) employed Gaussian Mixture Models (GMM) for modeling a user's diverse tastes of music and showed that utilizing such a GMM approach in combination with the acoustic features of the tracks the user listened to is able to capture a user's musical preferences well. The test set creation procedure applied (random 50 positive and 500 negative samples per user) allows for evaluating the ability to distinguish positive and negative samples. We have also experimented sampling 10 relevant and 100 irrelevant tracks for each user, however, we argue that given the high number of listening events per user in the dataset, sampling 50 positive and 500 negative tracks reflects a more suitable scenario. The results achieved were high in precision and low on the prediction error metrics, showing that the proposed models were able to detect the 50 positive samples and rank these on top.
As already stated in Section 4.4, we consider the classifi cationbased approach for the computation of recommendations as a baseline regarding the actual recommender system. However, we believe that even though the method is rather simple, it provides us with conclusive results regarding the user models evaluated, which was our focus.

Interplay Between Country Characteristics and Music Preferences
In the following, we analyze the cultural/socioeconomic and acoustic features on a country level more thoroughly, aiming to uncover countryspecific patterns of their inhabitants' music preferences in terms of acoustic features and to identify similarities and differences between countries (Section 7.1). We further investigate to which extent cultural/socioeconomic and acoustic features correlate with each other, on a perfeaturebasis (Section 7.2).

Country-specific Differences of Acoustic Feature Preferences
To obtain insights into countryspecific particularities of the acoustic properties of music consumption, we provide an overview of the investigated acoustic features (and their standard deviations) per country, computed over all users in each country in Table 6. Overall, we observe pronounced differences between countries for most of the properties, but also nonnegligible standard deviations within countries, indicating partly substantial variances in music preferences among citizens. Highest danceability in music preferences can be found in France (0.533), Colombia (0.532), and Mexico (0.529); the lowest in Iran (0.455). Notably, Iran is also the country with the lowest music energy (0.599) in its population's preferences. In contrast, the populations of Finland (0.806), Bulgaria (0.801), and Hungary (0.800) like highly energetic music. This is further evidenced when investigating their preferred music styles, which include several variants of the genre metal. As for speechiness, the lowest figures are found in Indonesia and Argentina (both 0.048), whereas music listeners in Poland (0.065) tend to listen more commonly to music featuring spoken words such as hip hop or rap. Acousticness is lowest for Finland (0.062) and Bulgaria (0.063); by far highest for Iran (0.278), China (0.232), and Turkey (0.199). As for instrumentalness, by far the lowestscoring countries are Brazil (0.029), Indonesia (0.040), and Argentina (0.059). At the other end, users in Romania (0.224) and Greece (0.198) particularly like non vocal instrumental music. Regarding liveness, Iran (0.133) and Turkey (0.137) show the lowest values, whereas Finland (0.166) has the highest figures for this attribute. This may be explained by Finns having a particular preference for live music and by Finland having a very vivid music performing culture and therefore a large number of hobby musicians as well as (semi)professional bands. Music listened to by Iranian users scores by far the lowest on the dimension of valence, on average (0.298). In stark contrast, music consumed in South and Middle America scores highest on this dimension; in particular, users in Colombia (0.486), Mexico (0.485), Argentina (0.482), and Brazil (0.478) tend to listen to a substantial amount of music that is suited to evoke positive emotions. Finally, when it comes to tempo, users in Iran and Turkey tend to prefer slower music, around 120 BPM on average. On the other hand, Venezuela, New Zealand, Hungary, and Germany prefer faster music, on average around 125 BPM.

Correlations Between Cultural Background and Music Preferences
To uncover possible relationships between acoustic properties of a country's inhabitants' music preferences and the cultural or socioeconomic characteristics, we investigate the correlation between each of the acoustic features and the cultural/socioeconomic dimensions. Tables 7 and 8 depict Spearman's rankorder correlation coefficients for Hofstede's cultural features and WHR socioeconomic characteristics, respectively. We use rank order correlation to cope with the different value ranges of the various dimensions investigated and compute these correlations considering all users in our dataset as observations. To describe each user's aggregated musical feature vector, we follow the same approach as detailed in Section 4.3. Correlations larger than 0.1 (or less than -0.1) are highlighted in bold. Statistically significant correlations are marked with an asterisk.
As a general observation, while almost all correlations are significant (even at p < 0.001), most are only weak, which hints at the different nature of aspects to compare. Nevertheless, some interesting observations can be made. Focusing on Table 7, we observe notable correlations for the cultural trait of indulgence (IND). More precisely, a positive correlation between IND and acousticness (0.125) as well as valence (0.114) is identified. This means that societies that like to engage in joyful activities tend to listen to music that has a higher probability of being acoustic and that evokes positive emotions, which makes sense. At the same time, indulging populations tend to prefer lower energy levels in music (correlation of -0.115), which hints at a preference for more relaxing music. Furthermore, uncertainty avoidance (UA) is positively correlated with music energy level (0.116), but negatively with acousticness (-0.122). Societies characterized by stiff codes and laws therefore tend to prefer more energetic music, but lower amounts of acoustic tracks. Also, there is a positive correlation between individualism (IDV) and acousticness (0.105).
Comparing the acoustic features with the WHR dimen sions, cf. Table 8, we can only observe two correla tions exceeding the threshold. Both relate to the aspect of generosity. More precisely, we see a positive correlation between generosity and acousticness (0.118), whereas a negative one with energy (-0.101). More generous populations therefore tend to prefer less energetic music, with a more acoustic sound.

Conclusion and Future Work
The contributions of this work are twofold: (i) we introduced a novel musiccultural user model that jointly relies on acoustic song features and culturerelated features to describe the user's musical preferences and cultural background and (ii) we proposed a recommender system that leverages these features as contextual information. Our evaluations based on a dataset comprising more than 55,000 users showed that the proposed user model is able to outperform models that incorporate either solely musical aspects or cultural aspects and the evaluated baseline methods (relying on user's country as a proxy for culture, utilizing solely the user's and track's identifiers). In regard to both recall and precision, we show that adding contextual information obtained via incorporating audio features of tracks, data extracted from the World Happiness Report and Hofstede's cultural dimensions, contributes to improved recommendations when compared to the baseline approaches. Particularly, we find that a combination of acoustic features of the songs a user listened to (describing the individual music preferences of a user) and the World Happiness Report as a description of the cultural/socioeconomic background of the user performs best.
Future work includes extending the user models with further data utilized for capturing cultural aspects of users (e.g., the Quality of Government dataset (Dahlberg et al., 2016)). Moreover, we are particularly interested in analyzing the countryspecific influence of each of the   individual features of the proposed user models on the overall recommendation performance to get a deeper understanding for cohesive features that constitute listening patterns. Regarding the representation of both the musical preferences and cultural aspects, we plan to investigate more sophisticated modeling approaches. Particularly regarding the representation of musical preferences of users, we believe that, e.g., using Gaussian mixture models will allow for a more differentiated representation of users and their (possibly diverse and broad) preferences. Finally, we aim to transcend the country level for our culturebased analyses, e.g., focusing on culturally similar users that live in the same cultural region (but not necessarily in the same country).