Hands-On AI Part 23: Deep Learning for Music Generation 2

A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers

At this point in the tutorial, all the relevant datasets have been found, collected, and preprocessed. For more information about these steps please check out the earlier articles in this series. The BachBot*¹ model was used to harmonize the melody. This article describes the processes of defining, training, testing, and modifying BachBot.

Defining a Model

In the previous article, (Deep Learning for Music Generation 1-Choosing a Model and Data Preprocessing), it was explained that the problem of automatic composition could be reduced to a problem of sequence prediction. In particular, the model should predict the most probable next note, given the previous notes. This type of problem is best suited for a long short-term memory (LSTM) neural network. Formally, the model should predict P(x_t+1 | x_t, h_t-1), a probability distribution of the possible next notes (x_t+1) given the current token (x_t), and the previous hidden state (h_t-1). Interestingly, this is the exact same operation performed by recurrent neural network (RNN) language models.

In composition, the model is initialized by the START token (see the previous article for more about the encoding scheme), and then picks the next most-likely token to follow it. After this, it continues to pick the most probable next token using the current note and the previous hidden state until it generates the END token. There are temperature controls, which introduce a degree of randomness to prevent BachBot from composing the same piece over and over again.

Loss

In training a prediction model, there is typically a function that should be minimized, called loss, that describes the difference of the model’s prediction to the ground truth. BachBot chose to minimize cross entropy loss between the predicted distribution (xt+1) and the actual target distribution. Cross entropy loss is a good starting point for a wide range of tasks, but in some cases you may have your own loss function. Another valid approach is to try different loss functions and keep the model that minimizes the actual loss in validation.

Training/Testing

In training the RNN, BachBot used to correct the token as x_t+1, instead of the prediction of the model. This process, known as teacher forcing, is used to aid convergence, as the model’s predictions will naturally be poor in the beginning of training. In contrast, during validation and composition, the prediction of the model (x_t+1) should be reused as input for the next prediction.

Other Considerations

Practical techniques that were used in this model to improve performance, and are common in LSTM networks, are gradient norm clipping, dropout, batch normalization, and truncated backpropagation through time (BPTT).

Gradient norm clipping mitigates the problem of the exploding gradient (the counterpart to the vanishing gradient problem, which was solved by using an LSTM memory cell architecture). When gradient norm clipping is used, gradients that exceed a certain threshold are clipped or scaled.

Dropout is a technique that causes certain neurons to randomly turn off (dropout) during training. This prevents overfitting and improves generalization. Overfitting is a problem that occurs when the model becomes optimized for the training dataset, and is less applicable to samples outside of the training dataset. Dropout often worsens training loss, but improves validation loss (more on this later).

Computing the gradient of an RNN on a sequence of length 1000 costs the equivalent of a forward and backward pass on a 1000 layer feedforward network. Truncated BPTT is used to reduce the cost of updating parameters in the training process. This means that errors are only propagated a fixed number of time steps backward. Note that learning long-term dependencies are still possible when using BPTT, as the hidden states have already been exposed to many previous time steps.

Parameters

The parameters that are relevant in RNN/LSTM models are:

The number of layers. As this increases, the model may become more powerful but slower to train. Also, having too many layers may result in overfitting.
The hidden state dimension. Increasing this may improve model capacity, but can cause overfitting.
Dimension of vector embeddings
Sequence length/number of frames before truncating BPTT.
Dropout probability. The probability that a neuron drops out at each update cycle.

Finding the optimal set of parameters will be discussed later in the article.

Implementation, Training and Testing

Choosing a Framework

Nowadays, there are many frameworks that help to implement machine learning models in a variety of languages (even JavaScript*!). Some popular frameworks are scikit-learn*, TensorFlow*, and Torch*.

Torch³ was selected as the framework for the BachBot project. TensorFlow was tried first, however it used unrolled RNNs at the time, which overflowed the graphics processing unit’s (GPU’s) RAM. Torch is a scientific computing framework that runs on the speedy language LuaJIT*. Torch has great neural network and optimization libraries.

Implementing and Training the Model

Implementation will clearly vary depending on the language and framework you end up choosing. To see how LSTMs were implemented using Torch in BachBot, check out the scripts used to train and define BachBot. These are available on Feynman Liang’s GitHub* site ²

A good starting place in navigating the repository is 1-train.zsh. From there you should be able to find your way to bachbot.py.

Specifically, the essential script that defines the model is LSTM.lua. The script that trains the model is train.lua.

Hyperparameter Optimization

To find the best hyperparameter settings, a grid search was used on the following grid.

Parameter
Number of layers	1	2	3	4
Hidden state dimension	128	256	384	512
Dimension of vector embeddings	16	32	64
Sequence length	64	128	256
Dropout probability	0.0	0.1	0.2	0.3	0.4	0.5

Figure 1: Parameter grid used in BachBot* grid search ¹.

A grid search is an exhaustive search over all the possible combinations of parameters. Other suggested hyperparameter optimizations are random search and Bayesian optimization.

The optimal hyperparameter set found by the grid search was: number of layers = 3, hidden state dimension = 256, dimension of vector embeddings = 32, sequence length = 128, and dropout = 0.3.

This model achieved 0.324 cross entropy loss in training, and 0.477 cross entropy loss in validation. Plotting the training curve shows that training converges after 30 iterations (≈28.5 minutes on a single GPU) ¹.

Plotting training and validation losses can also illustrate the effect of each hyperparameter. Of particular interest is dropout probability:

Figure 2: Training curves for various dropout settings¹.

From Figure 2 we can see that dropout indeed prevents overfitting, as although dropout = 0.0 has the lowest training loss, it has the highest validation loss; whereas higher dropout probabilities lead to higher training losses but lower validation losses. The lowest validation loss in BachBot’s case was when the dropout probability was 0.3.

Alternate Evaluation (optional)

For some models, especially for creative applications such as music composition, loss may not be the most appropriate measure of success. Instead, a better measure could be subjective human evaluation.

The goal of the BachBot project was to automatically compose music that is indistinguishable from Bach’s own compositions. To evaluate this, an online survey was conducted. The survey was framed as a challenge to see whether the user could distinguish between BachBot’s and Bach’s compositions.

The results showed that people who took the challenge (759 participants, varying skill levels) could only accurately discriminate between the two samples 59 percent of the time. This is only 9 percent above random guessing! Take The BachBot Challenge yourself!

Adapting the Model to Harmonization

BachBot can now compute P(x_t+1 | x_t, h_t-1), the probability distribution of the possible next notes given the current note and the previous hidden state. This sequential prediction model can then be adapted into one that harmonizes a melody. This adapted harmonization model is required for harmonizing the emotion-modulated melody for the slideshow music project.

In harmonization, a predefined melody is provided (typically the soprano line), and the model must then compose music for the other parts. A greedy best-first search under the constraint that melody notes are fixed is used for this task. Greedy algorithms involve making choices that are locally optimal. Thus, the simple strategy used for harmonization is described as follows:

Let x_t be the tokens in the proposed harmonization. At time step t, if the note is given as the melody, x_t equals the given note. Otherwise x_t is the most likely next note as predicted by the model. The code for this adaptation can be found on Feynman Liang’s GitHub: HarmModel.lua, harmonize.lua.

Below is an example of BachBot’s harmonization of Twinkle, Twinkle, Little Star, using the above strategy.

Figure 3: The BachBot* harmonization of Twinkle, Twinkle, Little Star (in the soprano line). Alto, tenor and bass parts were filled in by BachBot ¹.

In this example, the melody to Twinkle, Twinkle, Little Star is provided in the soprano line. The alto, tenor and bass parts are then filled by BachBot using the harmonization strategy. This is what that sounds like.

Despite the BachBot’s decent performance on this task, there are certain limitations to this model. Specifically, it doesn’t look ahead in the melody and uses only the current melody note and past context to generate notes. When people harmonize melodies, they can examine the whole melody, which makes it easier to infer appropriate harmonizations. The fact that this model can’t do that may result in surprises from future constraints, which cause mistakes. To solve this, a beam search may be used.

Beam searches explore multiple trajectories. For example, instead of only taking the most probable note (what is currently being done) it may take the four or five most probable, and explore each of these notes. Exploring multiple options can help the model recover from mistakes. Beam searches are commonly used in natural language processing applications to generate sentences.

Emotion-modulated melodies can now be put through this harmonization model to be completed. The way this is done is detailed in the final article describing application deployment.

Conclusion

This article used BachBot as a case study in discussing the considerations of building a creative deep learning model. Specifically, this article discussed techniques that improve generalization, and accelerate training for RNN/LSTM models, hyperparameter optimization, evaluation of the model, and ways to adapt a sequence prediction model for completion (or generation).

All of the parts of the Slideshow Music project are now complete. The final articles in this series will discuss how these parts are put together to form the final product. That is, they will discuss how the emotion-modulated melodies may be provided as an input to BachBot’s harmonization model, and deployment of the completed application.

References and Links

1. Liang, F., Gotham, M., Johnson, M., & Shotton, J. (2017). Automatic stylistic composition of Bach chorales with deep LSTM. 18th International Society for Music Information Retrieval Conference, Suzhou, China.

2. Liang, F. (2016) BachBot [Computer program]. Available at https://github.com/feynmanliang/bachbot/blob/master/scripts/

3. Collobert, R., Farabet, C., Kavukcuoglu, K., & Chintala, S. (2017). Torch (Version 7). Retrieved from http://torch.ch/.

Find more helpful resources at the Intel® Nervana™ AI Academy.

Hands-On AI Part 23: Deep Learning for Music Generation 2—Implementing the Model

Defining a Model

Loss

Training/Testing

Other Considerations

Parameters

Implementation, Training and Testing

Choosing a Framework

Implementing and Training the Model

Hyperparameter Optimization

Alternate Evaluation (optional)

Adapting the Model to Harmonization

Conclusion

References and Links

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112