The aim of this project is to perform some experimenta with graph embedding and link prediction starting from a raw dataset not yet semantically meaningful. It was developed as a final examination for the Knowledge Representation and Extraction course of the Digital Humanities and Digital Knowledge Master Degree at the University of Bologna held by professor Aldo Gangemi and Andrea Nuzzolese. The goal is to show that it is possible and fairly easy to transform non-linked data in a linked form, and that allows for many interesting options. DBpedia and Wikidata already offer a huge amount of entities related to music, but their properties are often very limited for musical records that are not much popular. Using a much wider dataset, which is more specialized in musical records, allow to make more interesting and accurate predictions. In the field of musical records, this could lead to a system which not only automatically predict the genres, descriptors and features of an album records, but also the probability of its likeness, taking into account the average scores given by the user.
In order to accomplish such task, the first step necessary was to gather a raw dataset that could be formalized in a set of triples. My aim was to work with a musical dataset regarding records and artists, so I extracted raw data from the RateYourMusic.com Website. RateYourMusic is a large musical database in which users are able to autonomously enter musical records and to label them according to various properties, such as its artist, its publication date, the musical genres, a range of descriptors and so on. For the purposes of this project, around 22000 musical records were extracted, ranging from the most popular to the lesser ones.
Then a simple and small ontology was developed using Protegé as a software. The ontology, which serves as a conceptual model, links an album record with its main properties, namely the artist that made the record, its main genre, its secondary genre, some descriptors and so on. I decided to implement my own ontology for the work in order to maintein the maximum flexibility; however, it could be useful and interesting to exploit already existing ontologies that deal with musical entities. Moreover, since one of the goal was also to align the dataset with Wikidata, the artist subclasses was subdivided in two different classes, i.e. humans (songwriters, performers, musicians, etc.) and musical groups. Those subclasses have properties as well, such as a birthplace for humans, a list of members for bands and so on. The list of properties for artists was completely arbitrary and it can be extended any time. You can download the Protegé file here.
From a practical point of view,Python was used in order to exploit the vast amount of libraries available for handling large datasets. Pandas was used to give a first look at the data, cleaning it a bit, eliminating duplicates and getting rid of some unnecessary properties. It was also important to align the entries with the already existing entries of a linked open dataset, such as DBPedia or Wikidata. Thus, a set of functions able to automatically extract information about a specific entity (using a SPARQL query) and put them in the main dataset were developed. Once the dataset was aligned (in this case with Wikidata), it became possible to serialize it as subject-predicate-object triples. Using an RDF library, it became possible to create a set of triples that linked every musical entity with its properties. Here you can see an example extracted from the graph using the Turtle syntax:
With the Turtle file, it became possible to make some experiment with graph embedding. For this task, two Python libraries were individuated:Pykeen and RDF2Vec. However, after an initial phase of testing, RDF2Vec was discarded due to the difficulties related to set up a proper work environment. Moreover, Pykeen offers a built-in function for link prediction, thus being much easier to set up for whom do not have prior knowledge in machine learning. For the training phase the dataset was automatically splatted in two (with a 80/20 proportion) and fed to the pipeline function without changing any of the default settings (except with the epoch number, which was set at 1500). You can download the results of the training here.
Finally, in order to get a more complete account of each record, a simple implementation of the nltk library was developed. This allows to get sentiment score (i.e. a positive, a neutral and a negative one) of each album based on the range of the descriptors. This score should represent the "mood" of the record, and it is not intended to express the value or the quality of it. For instance, a record could be labelled as mainly negative because its descriptors list includes terms such as "depressive", "melancholic" and so on. As it is often the case with the arts, negative feelings can be exploited to discuss important topics (e.g. depression) or to get rid of them.
All the data, including code and the trained model, can be found at the GitHub repository.
The data was extracted from the charts page using a JavaScript script developed by Simone Del Deo which converts them into a raw .csv file. For the purposes of this project, around 22000 album records were extracted, along with all their relevant properties.
In order to create a dataset semantically meaningful, some basic cleaning was required. Pandas was used to eliminate duplicates, excluding useless columns and linking each entry with the related Wikidata entry. In order to have a reasonable amount of data to work with, a selection of properties was made; in this case, it was properties regarding the artists of the records, e.g. their main occupation, their instrument skills etc.
The final table was then transformed into a Turtle serialization, linking each entry with their property through the form of subject-predicate-object. The final Turtle file contained more than 200K relations, linking records with their descriptor and the related artists with their Wikidata properties.
Once the linked dataset was built, it became possible to do some interesting operations on it. Pykeen was used as a library to create a graph embedding of the dataset. In order to exploit the Pykeen functionalities, the graph was converted into a tabular file, each row representing a triple. This allowed to perform many significant operations on the data, such as link prediction i.e. to predict the probabilities that a link between two nodes occurs. A simple and basic sentiment analysis based on the album descriptors was also implemented using nltk.
Bolded items are hits that were present in the training phase, green items are hits completely predicted by the algorithm.
The aim of the present project was to extract raw data and transform them in a linked form from which is possible to extract knowledge. Thanks to graph embedding it is possible to exploit this knowledge to gather new insights through machine learning algorithms. The current dataset seems to behave well for most of the link prediction tasks, with variable results depending on the specific relationship. There are clear differences between the different kind of predictions: for instance, the artist property was much harder to predict than the descriptors’ one. However, for what it concerns genres and descriptors, the system performs actually pretty well, and it is able to make meaningful predictions. It is also interesting to notice that even when the guessed are wrongs, usually they are still close to the actual hit. The system predicts artists that share something with the real record's artist, being that for origins, influences or musical genres; or a noisy, mellow shoeagze album was never labelled as "joyful" or similar descriptors. For what it concerns the properties inherited from Wikidata, the system seems pretty reliable as well: it always able to distinguish between the kind of artist (i.e. whether it is a single human being or a band) and to predict some of their properties. Again, it struggles a bit more when it comes with predicting band members.
There is no doubt that the current system could be enhanced in many ways. First of all, a small and arbitrary selection among the whole bunch of properties of the Wikidata dataset was made. Exploiting all the individual properties could lead to improvements in prediction tasks; moreover, it could be possible to link the band members with their Wikidata entry and gather further knowledge about the main entity. To actually explore all the potential of link prediction you need the whole graph to work with Secondary, here the average rating given by the users was not included because it would little sense to predict an integer using graph embedding. However, having at hands an average rating for all this album records could lead us to implement a separate machine learning algorithm to automatically predict a score (which of course should be weighted with the number of ratings). This could help in creating an accurate software for suggesting records to final users.
I hold a bachelor's and a MD degree in philosophy, both obtained at the University of Bologna. I recently graduated with a master thesis in Philosophy of Science titled "What Medicine and What Disease? Health and Disease in the Era of Personalized Medicine". I am currently enrolled in the DHDK program where I am focusing on data extraction, knowledge managment, natural language processing and all the nerdy code-related stuff.