Week 2 — Tune It Up

Adnan Fidan
BBM406 Spring 2021 Projects
7 min readApr 18, 2021

--

Hello world,
We are Fidan Samet, Oğuz Bakır and Adnan Fidan. In the scope of the Fundamentals of Machine Learning course project, we are working on prediction and style transfer of song release years. We will be writing blogs about our progress throughout the project and this is the second one of our blog series. The methods and approaches considered in the related works of our tasks will be covered in this post. So let’s get started!

Previously on Tune It Up…

Timeline of Tune It Up

Last week, we talked about the problem we consider and the datasets we may use on our tasks. You can find last week’s blog here. This week, we will examine the approaches and the methods used in the related works of our tasks.

Related Works

A. Song Year Classification

1. Song Year Prediction Using Apache Spark¹

  • Dataset: Million Song Dataset
  • Feature: Timbre
  • Evaluation Metric: Classification accuracy, Root-Mean-Square Error (RMSE)²

In this work, the authors use gradient boosted trees (GBTs), linear regression (LR), and random forest regression (RFR) algorithms to predict the release years of the songs. After data rescaling and normalization of the songs, they use two techniques to perform their task.

In the first technique, they train these four models with all data so that they do not perform data segregation. Below is the framework for this technique. They obtain the lowest RMSE value of 9.66 with RFR.

Model Framework without Data Segregate

In the second technique, they perform data segregation so that they divide the data into two classes as before the 90s and after the 90s. They classify these two classes with logistic regression classifier. Below is the framework for this technique. They obtain the lowest RMSE value of 13.68 with RFR. In this technique, they obtain a classification accuracy of 71.6%.

Model Framework with Data Segregate

2. Release Year Prediction for Songs³

  • Dataset: Million Song Dataset
  • Feature: Timbre
  • Evaluation Metric: Mean Absolute Error (MAE)⁴
Histogram Graph of Song Release Years in Million Song Dataset

In this work, the authors perform normalization on the data. Then, they predict the song years with the timbre features using baseline, linear regression (LR), and polynomial regression (PR). The baseline model is a self-developed naive approach that outputs the most frequent year in the data for the predicted year. They obtain the lowest MAE value of 6.64 with PR model. The closest MAE value of 6.85 is obtained with LR (normal equation) model.

3. The Million Song Dataset⁵

  • Dataset: Million Song Dataset
  • Feature: Audio features
  • EvaluationMetric: Average absolute difference (AAD) and the square root of the average squared difference (SRASD)

In this dataset, the authors perform song year prediction as a case study. For this purpose, they use k-Nearest Neighbors(k-NN) and Vowpal Wabbit (VM)⁶ algorithms. They obtain the lowest AAD and SRASD scores with VM.

B. Song Style Transfer

1. Audio Style Transfer⁷

  • Feature: Audio signals

In this work, the authors adapt the neural network framework from image style transfer study⁸ to perform the song style transfer. Below is the framework proposed by this work.

Proposed Framework

First of all, they pre-process the content and style waveforms of songs. They extract the texture statistics of these signal representations. Then, they iteratively modify the content sound so that the audio texture of the content sound can represent the style sound represent. They use Neural network-based and auditory-based approaches as the sound texture model.

In the neural network-based approach, the texture model is considered as data-driven approach. That makes this approach powerful but analyzing the results gets difficult. They experiment by giving 2D spectrograms to VGG-19 network, raw waveforms to SoundNet⁹, and 2D spectrograms to a one-layer CNN model as a wide-shallow-random network.

In the auditory-based approach, a statistical model prepared by auditory perception experts¹⁰ is used. The figure below shows the model results. The first row shows the content sound, the second row shows the style sound, and the remaining rows show the output of the texture models in spectrogram form.

Results of Proposed Models

There are noises in the results of the VGG-19 and SoundNet. The shallow network manages to add the local texture of the style sound to the global structure of the content sound. Although the statistical model gives compelling results, they are promising with successful recreation of local texture. Note that this model does not use any datasets or evaluation metrics. It only uses examples to test the model.

2. TimbreTron: A Wavenet Pipeline for Musical Timbre Transfer¹¹

  • Dataset: The authors create specific music data by using piano, flute, violin, and harpsichord instruments.
  • Feature: Audio spectrogram
  • Evaluation Metric: With Amazon Mechanical Turk¹², they ask the questions given in the table below and obtain success percentages from the answers.
Question Table for Evaluation Metrics

The authors use Short Time Fourier Transform (STFT)¹³ to convert the audio files into audio spectrograms in the data. After transferring styles of the spectrogram using the CycleGAN¹⁴ model between four different classes (piano, flute, violin, and harpsichord), they obtain a back audio file from the spectrograms using the Griffin-Lim¹⁵ algorithm. In this first attempt, the pitch is not transferred correctly and the sound quality is very low.

In the second experiment, they use the Constant-Q Transform (CQT)¹⁶ to generate spectrograms. This time, they use WaveNet Synthesizer¹⁷ to turn the outputs into sound, not changing the CycleGAN model. The reason they use WaveNet is that the spectrograms produce with CQT cannot be recreated as audio files. This model captures the pitches accurately and provides a quality transfer of style. Below is the high-level representation of this model.

High-Level Representation of Model

3. Style Transfer for Musical Audio Using Multiple Time-Frequency Representations¹⁸

  • Feature: Audio spectrogram
  • Evaluation Metric: Mean Squared Error (MSE), content key-invariance

In this work, the authors use three different representations of audio (the log-magnitude STFT, Mel spectrogram, and Constant Q Transform (CQT) spectrogram) to achieve high-quality sound transmission and texture synthesis. They use Mel Spectrogram to better capture rhythmic information and CQT Spectrogram in 2D convolutional neural networks to represent harmonic style. They try three approaches to musical style transfer.

In the first approach, Mel spectrogram content representation is used with only a few channels and they preserve clear rhythmic and timbral shape information but all information about the key is lost.

In the second approach, they use the convolved 2D feature map of the CQT spectrogram. Also, they use max-pooling to maintain the key invariance. This way, they better preserve their melodic and harmonic information without maintaining local key information.

Finally, the best approach is obtained by combining both the loss terms of all key invariant representations. Below is the table of results for all approaches.

Table of Results for All Approaches

That is all for this week. Thank you for reading and we hope to see you next week!

Bob Ross Says Goodbye

References

[1] Mishra, P., Garg, R., Kumar, A., Gupta, A., & Kumar, P. (2016, September). Song year prediction using Apache Spark. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1590–1594). IEEE.

[2] Root-mean-square deviation — Wikipedia. (2021). https://en.wikipedia.org/wiki/Root-mean-square_deviation

[3] Teixeira, M., & Rodrguez, M. M0444 Project One: Release Year Prediction for Songs.

[4] Mean absolute error — Wikipedia. (2021). https://en.wikipedia.org/wiki/Mean_absolute_error

[5] Bertin-Mahieux, T., Ellis, D. P., Whitman, B., & Lamere, P. (2011). The million song dataset.

[6] J. Langford, L. Li, and A. L. Strehl. Vowpal Wabbit (fast online learning), 2007. http://hunch.net/vw/.

[7] Grinstein, E., Duong, N. Q., Ozerov, A., & Pérez, P. (2018, April). Audio style transfer. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 586–590). IEEE.

[8] Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2414–2423).

[9] Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. arXiv preprint arXiv:1610.09001.

[10] McDermott, J. H., & Simoncelli, E. P. (2011). Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron, 71(5), 926–940.

[11] Huang, S., Li, Q., Anil, C., Bao, X., Oore, S., & Grosse, R. B. (2018). Timbretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer. arXiv preprint arXiv:1811.09620.

[12] Amazon Mechanical Turk. (2021). from https://www.mturk.com/

[13] Short-time Fourier transform — Wikipedia. (2021). https://en.wikipedia.org/wiki/Short-time_Fourier_transform

[14] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232)

[15] Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2), 236–243.

[16] Constant-Q transform — Wikipedia. (2018). https://en.wikipedia.org/wiki/Constant-Q_transform

[17] Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

[18] Barry, S., & Kim, Y. (2018). “Style” Transfer for Musical Audio Using Multiple Time-Frequency Representations.

Past Blogs

--

--