Implementation

Project grounding and problems solutions with examples

WHY THE NEED FOR BIG DATA?

Our project has to work with a huge amount of data and a simple application wouldn't have been suitable for many reasons. First of all, the music data: we had so many tracks, genres (around 900), artists for each country (around 194) and, moreover, we needed to process it in order to get usable results to be compared with other data taken from a totally different field, the HDI index and the depression rate (each one of these for each nation in the world). It sounds pretty clear that we couldn't have developed an application like this using simple coding or loops, since the response time would have been whopping. To solve this problem, we used two services: the Amazon Web Servers, which provided us a cluster on which we runned the code, and Apache Spark, a unified analytics engine for large-scale data processing.

HOW DOES IT WORK?

The project is based on the map-reduce model, using as programming language Python. We developed many phases to reach the final result. In order to work with the datasets we had to do a lot of data cleaning: First of all the music data: as the API could not give us music genres of each song we had to use the tags through track.getInfo. These tags could be many different things, for example the artist or song name, the year it was published or other things that are associated with the song. To find genres in these tags we first cleaned them so that all of the data has the same structures (e.g. no capital letters or special signs like "-"). Then we used the other list with the cleaned genres to compare each tag of a song. The genres would then be added up per country. The HDI and depression rates datasets also had to be cleaned before usage. The depression rates dataset gives values for the last 15 years of each country so we only used the most recent value as it is most accurate.

Workflow

Now we're going to explain how we moved through the code and the guideline followed:

Extract tags from last.fm API per country:


def getSongInfo(mbid):
    return requests.get("http://ws.audioscrobbler.com/[...]_
                
                
def tagsExtractor(track):
    mbid = track['mbid'].replace('"', '')
    ...
    if 'track' in response.keys():
       topTagsAndLinks = response['track']['toptags']['tag']
       tags = []
       for tagsAndLinks in topTagsAndLinks:
           tags.append(tagsAndLinks['name'])
           ...
       return tags
                  
              
def createGroups(country_attribute):
    country_attribute = country_attribute.sortBy(lambda x: x[1]).collect()
    number_of_groups = 15
    number_of_countries = int(len(country_attribute)/number_of_groups)
    country_attribute_groups = []
    ... 

    return country_attribute_groups

Cleanup data and filter tags:


def cleanup(tag_number):
    tag, number = tag_number
    newTag = tag.replace("-", " ").lower()
    return newTag, number
                            
    genres_clean = sc.textFile('genres_clean.txt')

    for line in genres_clean.collect():
    	genres_dict[line] = 0
    ...

    for gt in depressionGroupsTags.collect():

    	tag, number = gt
   	if tag in genres_dict.keys():
       		genres_dict[tag] = number

Count the genres per group of countries:


for key in genres_dict.keys():
       if genres_dict[key] != 0:
          groupOut.write("{}:{}\n".format(key, genres_dict[key]))