Analyzing Six Deep Learning Tools for Music Generation

Posted on

As deep learning is gaining in popularity, creative applications are gaining traction as well. Looking at music generation through deep learning, new algorithms and songs are popping up on a weekly basis. In this post we will go over six major players in the field, and point out some difficult challenges these systems still face. GitHub links are provided for those who are interested in the technical details (or if you’re looking to generate some music of your own).


Magenta is Google’s open source deep learning music project. They aim to use machine learning to generate compelling music. The project went open source in June 2016 and currently implements a regular RNN and two LSTM’s.
Great, because: It can handle any monophonic midi file. The documentation is good, so it’s relatively easy to set-up. The team is actively improving the models and adding functionality. For every model Magenta has provided a training bundle that is trained on thousands of midi files. You can start generating new midi files right away using these pre-trained models.
Challenges: At this point, Magenta can only generate a single stream of notes. Efforts have been made to combine the generated melodies with drums and guitars – but based on human input, as of yet. Once a model that can process polyphonic music has been trained, it could start to create harmonies (or at least multiple streams of notes). This would indeed be a mighty step on their quest for the generation of some compelling music.
Sounds like: The piece below is generated by Magenta from the 8th note onward. Here they use their attention model with the provided pre-trained bundle.


The result of a thirty-six-hour hackathon by Ji-Sung Kim. It uses a two layer LSTM that learns from a midi file as its input source. DeepJazz has received quite some news coverage in the first six months of its existence.
Great, because: Can create some jazz by being trained on a single midi file. The project itself is also compelling proof that creating a working computational music prototype using deep learning techniques can be a matter of hours thanks to libraries like Keras, Theano & Tensorflow.
Challenges: While it can handle chords, it converts the jazz midi to a single pitch and single instrument. It would take a few more post-processing steps for the deep learning created melodies to sound more like human created jazz music.
Sounds like: The following piece is generated after 128 epochs (i.e. the training set consisting of a single midi file has cycled through the model that many times).


A research project by Feynman Liang at Cambridge University,  also using an LSTM. This time it is used to train itself on Bach chorales. It’s goal is to generate and harmonize chorales indistinguishable from Bach’s own work. The website offers a test where one can listen to two streams and guess which one is an actual composition by Bach.
Great, because: Research found that people have a hard time distinguishing generated Bach from the real stuff. Also, this is one of the best efforts in handling polyphonic music as the algorithm can handle up to four voices.
Challenges: BachBot works best if one or more of the voices are fixed. Otherwise the algorithm just generates wandering chorales.The algorithm could be used to add chorales to a generated melody.
Sounds like: In the below example the notes for “Twinkle Twinkle Little Star” were fixed, with the chorales generated.


In the picturesque city of Paris, a research team is working on a system that can help to keep an artist in a creative flow. Their system can generate leadsheets based on the style of a composer in a database filled with about 13000 sheets. Markov constraints are used here as neural network technique.
GitHub: not open source.
Great, because: The system has composed the first AI pop-songs.
Challenges: Producing pop songs from a generated leadsheet to these pop songs is not simply done at the click of a button – it still requires a well-skilled musician to create a compelling song like in the example below. Reducing the difficulty of these steps with the help of deep learning is still an open challenge.
Sounds like: The song is composed by the FlowMachines AI. In order to do so, the musician chose the “Beatles” style, and generated melody and harmony. Note the rest of the score (production, mixing, and assigning audio pieces to the notes) was produces by human composer.


Researchers at Google’s DeepMind have created Wavenet. Wavenet is based on Convolutional Neural Networks, the deep learning technique that works very well in image classification and generation in the past few years. Their most promising purpose is to enhance text-to-speech applications by generating a more natural flow in vocal sound. However, their method can also be applied to music as both the input and output consists of raw audio.
GitHub: WaveNet’s code is not open source, but others have implemented it based on DeepMind’s documentation. For example:
Great, because: It uses raw audio as input. Therefore it can generate any kind of instrument, and even any kind of sound. It will be interesting to see what this technique is capable of once trained on hours of music.
Challenges: The algorithm is computationally expensive. It takes minutes to train on a second of sound. Some have started to create a faster version. Another researcher working for Google, Sageev Oore from the Magenta project, has written a blog post where he describes what can be learned from the musical output of Wavenet. One of his conclusions is that the algorithm can produce piano notes without a beginning, making them unplayable on a real piano. Interestingly, Wavenet can extend the current library of sounds that a piano can create and produce a new form of piano music – perhaps the next step in (generated) music.
Sounds like: Trained on a dataset of piano music results in the following ten seconds of sound:


A Stanford research project that, similar to Wavenet, also tries to use audio waveforms as input, but with an LSTM’s and GRU’s rather than CNN’s. They have showed their proof of concept to the world in June 2015.
Great, because: The Stanford researchers were one of the first to show how to generate sounds with an LSTM using raw waveforms as input.
Challenges: The demonstration they provide seems over-fitted on a particular song, due to the small training corpus and the sheer amount of layers of the NN. The researchers themselves did not have the time nor computational power to experiment further with this. Fortunately, this void is starting to get filled by researchers from WaveNet and other enthusiasts. Jakub Fiala has used this code to generate an interesting amen drum break, see this blog post.
Sounds like: The tool trained on a variety of Madeon songs, resulted in the below sample. Until 1:10 is an excerpt of the creation after 100 up to 1000 iterations, after that is a mash-up of their best generated pieces. This excerpt is a recording of this video.

Notes VS Waves

The described deep learning music applications can be divided into two categories based on the input method. Magenta, DeepJazz, BachBot, and FlowMachines all use input in the form of note sequences, while GRUV and Wavenet use raw audio.

Input type: Note sequences Raw audio
Computational complexity Low (minutes – few hours) High (few hours – days)
Editable result Yes, can be imported in music production software No, waveform itself has to be edited
Musical complexity As complex as a single song from the corpus As complex as the combination of the entire corpus

Can we call out a clear winner? In my opinion: no. Each has different applications and these methods can coexist until generating compelling music with raw audio becomes so fast that there is simply no point in doing it yourself.

Music will be easier to create by people who are assisted by an AI that can suggest a melody or harmony. However, these people still need to be musicians (for now). The moment it is possible to train a deep learning algorithm on your entire Spotify history in raw audio form, and generate new songs, everyone can be a musician.

Image classification and generation has been improved with neural network techniques, reaching higher benchmark scores than ever before, mostly thanks to the speed at which huge sets of pixels can be trained. For audio the overarching question is: when will raw audio overtake notes as the pixel of music?

Did you miss anything, or do you have any other feedback? Comments are greatly appreciated. At the Asimov Institute we do deep learning research and development, so be sure to follow us on Twitter for future updates and posts!  In this post we did no go into the technical details, but if you’re new to deep learning or unfamiliar with a method, I refer you to one of our previous posts on neural networks.

We are currently working on generating electronic dance music using deep learning. If you want to share your ideas on this, or have some interesting data to show, please send a message to Thank you for reading!